linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] Introduce FDBox, and preserve memfd with shmem over KHO
@ 2025-03-07  0:57 Pratyush Yadav
  2025-03-07  0:57 ` [RFC PATCH 1/5] misc: introduce FDBox Pratyush Yadav
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pratyush Yadav, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

This series introduces the File Descriptor Box (FDBox), along with
support in memfd and shmem for persisting memfds over KHO using FDBox.

FDBox is a mechanism for userspace to name file descriptors and give
them over to the kernel to hold. They can later be retrieved by passing
in the same name. The primary purpose of it is to be used with Kexec
Handover (KHO) [0]. See Documentation/kho/fdbox.rst for more details.

The original idea for FDBox came from Alex Graf. The main problem it
attempts to solve is to give a name to anonymous file descriptors like
memfd, guest_memfd, iommufd, etc. so they can be retrieved after KHO.
Alex wrote some initial code [1] which this series is based upon, though
that code is quite hacky and proof-of-concept, and has been
significantly refactored and polished with this series. Alex's code
mainly played around with KVM, but since I am more familiar with memfd
and shmem, I have picked those as the first users.

That is not to say this series is in a top notch state already. There is
still a lot of room for improvement, both in FDBox and in memfd and
shmem. The aim of the patches is to present the idea to get early
feedback, and to demonstrate KHO in action, potentially having a
consumer of KHO ready by the time those patches are ready for prime
time.

I have written a simple userspace tool to interact with FDBox and memfd.
It can be found here [2]. It is quite simple currently. When given the
create command, it creates a box, and a memfd, and fills the memfd with data
from a file called "data". It then adds the memfd to the box and seals
it. Then one can do KHO. After KHO, the restore command gets the FD out
of the box and writes the output to a file called "out". The original
and new file can be compared to ensure data consistency.

I have tested using the tool and a 1 GiB file, and the memfd came back
over KHO with the same contents. The performance was fast enough to not
be noticeable to the naked eye, though I have not done much more
performance analysis than that.

The whole process can be seen in action in this Asciinema [4].

Sample instructions to use the tool:

       $ make
       $ dd if=/dev/urandom of=data bs=1G count=1
       $ ./fdbox create
       $ echo 1 > /sys/kernel/kho/active
       $ kexec -s -l [...]
       $ kexec -e

After the kexec is done,

      $ ./fdbox restore
      $ cmp data out
      $ echo $?

The full tree with the patches can be found at [3]. It contains a couple
of my patches on top of Mike's KHO patches [0] to fix some small bugs.

[0] https://lore.kernel.org/lkml/20250206132754.2596694-1-rppt@kernel.org/
[1] https://github.com/agraf/linux-2.6/blob/kvm-kho-gmem-test/drivers/misc/fdbox.c
[2] https://github.com/prati0100/fdbox-utils/blob/main/fdbox.c
[3] https://web.git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/log/?h=kho
[4] https://asciinema.org/a/mnyVpy1w67mueIkKZzqHI0oAN

Pratyush Yadav (5):
  misc: introduce FDBox
  misc: add documentation for FDBox
  mm: shmem: allow callers to specify operations to shmem_undo_range
  mm: shmem: allow preserving file over FDBOX + KHO
  mm/memfd: allow preserving FD over FDBOX + KHO

 Documentation/filesystems/locking.rst |  21 +
 Documentation/kho/fdbox.rst           | 224 ++++++++
 Documentation/kho/index.rst           |   3 +
 MAINTAINERS                           |   9 +
 drivers/misc/Kconfig                  |   7 +
 drivers/misc/Makefile                 |   1 +
 drivers/misc/fdbox.c                  | 758 ++++++++++++++++++++++++++
 include/linux/fdbox.h                 | 119 ++++
 include/linux/fs.h                    |   7 +
 include/linux/miscdevice.h            |   1 +
 include/linux/shmem_fs.h              |   6 +
 include/uapi/linux/fdbox.h            |  61 +++
 mm/memfd.c                            | 128 ++++-
 mm/shmem.c                            | 498 +++++++++++++++--
 14 files changed, 1800 insertions(+), 43 deletions(-)
 create mode 100644 Documentation/kho/fdbox.rst
 create mode 100644 drivers/misc/fdbox.c
 create mode 100644 include/linux/fdbox.h
 create mode 100644 include/uapi/linux/fdbox.h

-- 
2.47.1


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-07  0:57 [RFC PATCH 0/5] Introduce FDBox, and preserve memfd with shmem over KHO Pratyush Yadav
@ 2025-03-07  0:57 ` Pratyush Yadav
  2025-03-07  6:03   ` Greg Kroah-Hartman
  2025-03-07  9:31   ` Christian Brauner
  2025-03-07  0:57 ` [RFC PATCH 2/5] misc: add documentation for FDBox Pratyush Yadav
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pratyush Yadav, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

The File Descriptor Box (FDBox) is a mechanism for userspace to name
file descriptors and give them over to the kernel to hold. They can
later be retrieved by passing in the same name.

The primary purpose of FDBox is to be used with Kexec Handover (KHO).
There are many kinds anonymous file descriptors in the kernel like
memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
using KHO. To be able to do that, there needs to be a mechanism to label
FDs that allows userspace to set the label before doing KHO and to use
the label to map them back after KHO. FDBox achieves that purpose by
exposing a miscdevice which exposes ioctls to label and transfer FDs
between the kernel and userspace. FDBox is not intended to work with any
generic file descriptor. Support for each kind of FDs must be explicitly
enabled.

While the primary purpose of FDBox is to be used with KHO, it does not
explicitly require CONFIG_KEXEC_HANDOVER, since it can be used without
KHO, simply as a way to preserve or transfer FDs when userspace exits.

Co-developed-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Alexander Graf <graf@amazon.com>
Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---

Notes:
    In a real live-update environment, it would likely make more sense to
    have a way of passing a hint to the kernel that KHO is about to happen
    and it should start preparing. Having as much state serialized as
    possible before the KHO freeze would help reduce downtime. An FDBox
    operation, say FDBOX_PREPARE_FD that can give the signal to prepare
    before actually being put in the box and sealed. I have not added
    something like that yet for simplicity sake.

 MAINTAINERS                |   8 +
 drivers/misc/Kconfig       |   7 +
 drivers/misc/Makefile      |   1 +
 drivers/misc/fdbox.c       | 758 +++++++++++++++++++++++++++++++++++++
 include/linux/fdbox.h      | 119 ++++++
 include/linux/fs.h         |   7 +
 include/linux/miscdevice.h |   1 +
 include/uapi/linux/fdbox.h |  61 +++
 8 files changed, 962 insertions(+)
 create mode 100644 drivers/misc/fdbox.c
 create mode 100644 include/linux/fdbox.h
 create mode 100644 include/uapi/linux/fdbox.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 82c2ef421c000..d329d3e5514c5 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8862,6 +8862,14 @@ F:	include/scsi/libfc.h
 F:	include/scsi/libfcoe.h
 F:	include/uapi/scsi/fc/
 
+FDBOX
+M:	Pratyush Yadav <pratyush@kernel.org>
+L:	linux-fsdevel@vger.kernel.org
+S:	Maintained
+F:	drivers/misc/fdbox.c
+F:	include/linux/fdbox.h
+F:	include/uapi/linux/fdbox.h
+
 FILE LOCKING (flock() and fcntl()/lockf())
 M:	Jeff Layton <jlayton@kernel.org>
 M:	Chuck Lever <chuck.lever@oracle.com>
diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
index 56bc72c7ce4a9..6fee70c9479c4 100644
--- a/drivers/misc/Kconfig
+++ b/drivers/misc/Kconfig
@@ -632,6 +632,13 @@ config MCHP_LAN966X_PCI
 	    - lan966x-miim (MDIO_MSCC_MIIM)
 	    - lan966x-switch (LAN966X_SWITCH)
 
+config FDBOX
+	bool "File Descriptor Box device to persist fds"
+	help
+	  Add a new /dev/fdbox directory that allows user space to preserve specific
+	  types of file descritors when user space exits. Also preserves the same
+	  types of file descriptors across kexec when KHO is enabled.
+
 source "drivers/misc/c2port/Kconfig"
 source "drivers/misc/eeprom/Kconfig"
 source "drivers/misc/cb710/Kconfig"
diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
index 545aad06d0885..59a398dcfcd64 100644
--- a/drivers/misc/Makefile
+++ b/drivers/misc/Makefile
@@ -75,3 +75,4 @@ lan966x-pci-objs		:= lan966x_pci.o
 lan966x-pci-objs		+= lan966x_pci.dtbo.o
 obj-$(CONFIG_MCHP_LAN966X_PCI)	+= lan966x-pci.o
 obj-y				+= keba/
+obj-$(CONFIG_FDBOX)		+= fdbox.o
diff --git a/drivers/misc/fdbox.c b/drivers/misc/fdbox.c
new file mode 100644
index 0000000000000..a8f6574e2c25f
--- /dev/null
+++ b/drivers/misc/fdbox.c
@@ -0,0 +1,758 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * fdbox.c - framework to preserve file descriptors across
+ *           process lifetime and kexec
+ *
+ * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
+ *
+ * Author: Pratyush Yadav <ptyadav@amazon.de>
+ * Author: Alexander Graf <graf@amazon.com>
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/device.h>
+#include <linux/anon_inodes.h>
+#include <linux/cdev.h>
+#include <linux/miscdevice.h>
+#include <linux/fdtable.h>
+#include <linux/file.h>
+#include <linux/kexec.h>
+#include <linux/kexec_handover.h>
+#include <linux/libfdt.h>
+#include <linux/fdbox.h>
+
+static struct miscdevice fdbox_dev;
+
+static struct {
+	struct class			*class;
+	dev_t				box_devt;
+	struct xarray			box_list;
+	struct xarray			handlers;
+	struct rw_semaphore		recover_sem;
+	bool				recover_done;
+} priv = {
+	.box_list = XARRAY_INIT(fdbox.box_list, XA_FLAGS_ALLOC),
+	.handlers = XARRAY_INIT(fdbox.handlers, XA_FLAGS_ALLOC),
+	.recover_sem = __RWSEM_INITIALIZER(priv.recover_sem),
+};
+
+struct fdbox_handler {
+	const char *compatible;
+	struct file *(*fn)(const void *fdt, int offset);
+};
+
+static struct fdbox *fdbox_remove_box(char *name)
+{
+	struct xarray *boxlist = &priv.box_list;
+	unsigned long box_idx;
+	struct fdbox *box;
+
+	xa_lock(boxlist);
+	xa_for_each(boxlist, box_idx, box) {
+		if (!strcmp(box->name, name)) {
+			__xa_erase(boxlist, box_idx);
+			break;
+		}
+	}
+	xa_unlock(boxlist);
+
+	return box;
+}
+
+static struct fdbox_fd *fdbox_remove_fd(struct fdbox *box, char *name)
+{
+	struct xarray *fdlist = &box->fd_list;
+	struct fdbox_fd *box_fd;
+	unsigned long idx;
+
+	xa_lock(fdlist);
+	xa_for_each(fdlist, idx, box_fd) {
+		if (!strncmp(box_fd->name, name, sizeof(box_fd->name))) {
+			__xa_erase(fdlist, idx);
+			break;
+		}
+	}
+	xa_unlock(fdlist);
+
+	return box_fd;
+}
+
+/* Must be called with box->rwsem held. */
+static struct fdbox_fd *fdbox_put_file(struct fdbox *box, const char *name,
+				       struct file *file)
+{
+	struct fdbox_fd *box_fd __free(kfree) = NULL, *cmp;
+	struct xarray *fdlist = &box->fd_list;
+	unsigned long idx;
+	u32 newid;
+	int ret;
+
+	/* Only files that set f_fdbox_op are allowed in the box. */
+	if (!file->f_fdbox_op)
+		return ERR_PTR(-EOPNOTSUPP);
+
+	box_fd = kzalloc(sizeof(*box_fd), GFP_KERNEL);
+	if (!box_fd)
+		return ERR_PTR(-ENOMEM);
+
+	if (strscpy_pad(box_fd->name, name, sizeof(box_fd->name)) < 0)
+		/* Name got truncated. This means the name is not NUL-terminated. */
+		return ERR_PTR(-EINVAL);
+
+	box_fd->file = file;
+	box_fd->box = box;
+
+	xa_lock(fdlist);
+	xa_for_each(fdlist, idx, cmp) {
+		/* Look for name collisions. */
+		if (!strcmp(box_fd->name, cmp->name)) {
+			xa_unlock(fdlist);
+			return ERR_PTR(-EEXIST);
+		}
+	}
+
+	ret = __xa_alloc(fdlist, &newid, box_fd, xa_limit_32b, GFP_KERNEL);
+	xa_unlock(fdlist);
+	if (ret)
+		return ERR_PTR(ret);
+
+	return_ptr(box_fd);
+}
+
+static long fdbox_put_fd(struct fdbox *box, unsigned long arg)
+{
+	struct fdbox_put_fd put_fd;
+	struct fdbox_fd *box_fd;
+	struct file *file;
+	int ret;
+
+	if (copy_from_user(&put_fd, (void __user *)arg, sizeof(put_fd)))
+		return -EFAULT;
+
+	guard(rwsem_read)(&box->rwsem);
+
+	if (box->sealed)
+		return -EBUSY;
+
+	file = fget_raw(put_fd.fd);
+	if (!file)
+		return -EINVAL;
+
+	box_fd = fdbox_put_file(box, put_fd.name, file);
+	if (IS_ERR(box_fd)) {
+		fput(file);
+		return PTR_ERR(box_fd);
+	}
+
+	ret = close_fd(put_fd.fd);
+	if (ret) {
+		struct fdbox_fd *del;
+
+		del = fdbox_remove_fd(box, put_fd.name);
+		/*
+		 * If we fail to remove from list, it means someone else took
+		 * the FD out. In that case, they own the refcount of the file
+		 * now.
+		 */
+		if (del == box_fd)
+			fput(file);
+
+		return ret;
+	}
+
+	return 0;
+}
+
+static long fdbox_seal(struct fdbox *box)
+{
+	struct fdbox_fd *box_fd;
+	unsigned long idx;
+	int ret;
+
+	guard(rwsem_write)(&box->rwsem);
+
+	if (box->sealed)
+		return -EBUSY;
+
+	xa_for_each(&box->fd_list, idx, box_fd) {
+		const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op;
+
+		if (fdbox_ops && fdbox_ops->seal) {
+			ret = fdbox_ops->seal(box);
+			if (ret)
+				return ret;
+		}
+	}
+
+	box->sealed = true;
+
+	return 0;
+}
+
+static long fdbox_unseal(struct fdbox *box)
+{
+	struct fdbox_fd *box_fd;
+	unsigned long idx;
+	int ret;
+
+	guard(rwsem_write)(&box->rwsem);
+
+	if (!box->sealed)
+		return -EBUSY;
+
+	xa_for_each(&box->fd_list, idx, box_fd) {
+		const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op;
+
+		if (fdbox_ops && fdbox_ops->seal) {
+			ret = fdbox_ops->seal(box);
+			if (ret)
+				return ret;
+		}
+	}
+
+	box->sealed = false;
+
+	return 0;
+}
+
+static long fdbox_get_fd(struct fdbox *box, unsigned long arg)
+{
+	struct fdbox_get_fd get_fd;
+	struct fdbox_fd *box_fd;
+	int fd;
+
+	guard(rwsem_read)(&box->rwsem);
+
+	if (box->sealed)
+		return -EBUSY;
+
+	if (copy_from_user(&get_fd, (void __user *)arg, sizeof(get_fd)))
+		return -EFAULT;
+
+	if (get_fd.flags)
+		return -EINVAL;
+
+	fd = get_unused_fd_flags(0);
+	if (fd < 0)
+		return fd;
+
+	box_fd = fdbox_remove_fd(box, get_fd.name);
+	if (!box_fd) {
+		put_unused_fd(fd);
+		return -ENOENT;
+	}
+
+	fd_install(fd, box_fd->file);
+	kfree(box_fd);
+	return fd;
+}
+
+static long box_fops_unl_ioctl(struct file *filep,
+			       unsigned int cmd, unsigned long arg)
+{
+	struct fdbox *box = filep->private_data;
+	long ret = -EINVAL;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	switch (cmd) {
+	case FDBOX_PUT_FD:
+		ret = fdbox_put_fd(box, arg);
+		break;
+	case FDBOX_UNSEAL:
+		ret = fdbox_unseal(box);
+		break;
+	case FDBOX_SEAL:
+		ret = fdbox_seal(box);
+		break;
+	case FDBOX_GET_FD:
+		ret = fdbox_get_fd(box, arg);
+		break;
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static int box_fops_open(struct inode *inode, struct file *filep)
+{
+	struct fdbox *box = container_of(inode->i_cdev, struct fdbox, cdev);
+
+	filep->private_data = box;
+
+	return 0;
+}
+
+static const struct file_operations box_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= box_fops_unl_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+	.open		= box_fops_open,
+};
+
+static void fdbox_device_release(struct device *dev)
+{
+	struct fdbox *box = container_of(dev, struct fdbox, dev);
+	struct xarray *fdlist = &box->fd_list;
+	struct fdbox_fd *box_fd;
+	unsigned long idx;
+
+	unregister_chrdev_region(box->dev.devt, 1);
+
+	xa_for_each(fdlist, idx, box_fd) {
+		xa_erase(fdlist, idx);
+		fput(box_fd->file);
+		kfree(box_fd);
+	}
+
+	xa_destroy(fdlist);
+	kfree(box);
+}
+
+static struct fdbox *_fdbox_create_box(const char *name)
+{
+	struct fdbox *box;
+	int ret = 0;
+	u32 id;
+
+	box = kzalloc(sizeof(*box), GFP_KERNEL);
+	if (!box)
+		return ERR_PTR(-ENOMEM);
+
+	xa_init_flags(&box->fd_list, XA_FLAGS_ALLOC);
+	xa_init_flags(&box->pending_fds, XA_FLAGS_ALLOC);
+	init_rwsem(&box->rwsem);
+
+	if (strscpy_pad(box->name, name, sizeof(box->name)) < 0) {
+		/* Name got truncated. This means the name is not NUL-terminated. */
+		kfree(box);
+		return ERR_PTR(-EINVAL);
+	}
+
+	dev_set_name(&box->dev, "fdbox/%s", name);
+
+	ret = alloc_chrdev_region(&box->dev.devt, 0, 1, name);
+	if (ret) {
+		kfree(box);
+		return ERR_PTR(ret);
+	}
+
+	box->dev.release = fdbox_device_release;
+	device_initialize(&box->dev);
+
+	cdev_init(&box->cdev, &box_fops);
+	box->cdev.owner = THIS_MODULE;
+	kobject_set_name(&box->cdev.kobj, "fdbox/%s", name);
+
+	ret = cdev_device_add(&box->cdev, &box->dev);
+	if (ret)
+		goto err_dev;
+
+	ret = xa_alloc(&priv.box_list, &id, box, xa_limit_32b, GFP_KERNEL);
+	if (ret)
+		goto err_cdev;
+
+	return box;
+
+err_cdev:
+	cdev_device_del(&box->cdev, &box->dev);
+err_dev:
+	/*
+	 * This should free the box and chrdev region via
+	 * fdbox_device_release().
+	 */
+	put_device(&box->dev);
+
+	return ERR_PTR(ret);
+}
+
+static long fdbox_create_box(unsigned long arg)
+{
+	struct fdbox_create_box create_box;
+
+	if (copy_from_user(&create_box, (void __user *)arg, sizeof(create_box)))
+		return -EFAULT;
+
+	if (create_box.flags)
+		return -EINVAL;
+
+	return PTR_ERR_OR_ZERO(_fdbox_create_box(create_box.name));
+}
+
+static void _fdbox_delete_box(struct fdbox *box)
+{
+	cdev_device_del(&box->cdev, &box->dev);
+	unregister_chrdev_region(box->dev.devt, 1);
+	put_device(&box->dev);
+}
+
+static long fdbox_delete_box(unsigned long arg)
+{
+	struct fdbox_delete_box delete_box;
+	struct fdbox *box;
+
+	if (copy_from_user(&delete_box, (void __user *)arg, sizeof(delete_box)))
+		return -EFAULT;
+
+	if (delete_box.flags)
+		return -EINVAL;
+
+	box = fdbox_remove_box(delete_box.name);
+	if (!box)
+		return -ENOENT;
+
+	_fdbox_delete_box(box);
+	return 0;
+}
+
+static long fdbox_fops_unl_ioctl(struct file *filep,
+				 unsigned int cmd, unsigned long arg)
+{
+	long ret = -EINVAL;
+
+	switch (cmd) {
+	case FDBOX_CREATE_BOX:
+		ret = fdbox_create_box(arg);
+		break;
+	case FDBOX_DELETE_BOX:
+		ret = fdbox_delete_box(arg);
+		break;
+	}
+
+	return ret;
+}
+
+static const struct file_operations fdbox_fops = {
+	.owner		= THIS_MODULE,
+	.unlocked_ioctl	= fdbox_fops_unl_ioctl,
+	.compat_ioctl	= compat_ptr_ioctl,
+};
+
+static struct miscdevice fdbox_dev = {
+	.minor = FDBOX_MINOR,
+	.name = "fdbox",
+	.fops = &fdbox_fops,
+	.nodename = "fdbox/fdbox",
+	.mode = 0600,
+};
+
+static char *fdbox_devnode(const struct device *dev, umode_t *mode)
+{
+	char *ret = kasprintf(GFP_KERNEL, "fdbox/%s", dev_name(dev));
+	return ret;
+}
+
+static int fdbox_kho_write_fds(void *fdt, struct fdbox *box)
+{
+	struct fdbox_fd *box_fd;
+	struct file *file;
+	unsigned long idx;
+	int err = 0;
+
+	xa_for_each(&box->fd_list, idx, box_fd) {
+		file = box_fd->file;
+
+		if (!file->f_fdbox_op->kho_write) {
+			pr_info("box '%s' FD '%s' has no KHO method. It won't be saved across kexec\n",
+				box->name, box_fd->name);
+			continue;
+		}
+
+		err = fdt_begin_node(fdt, box_fd->name);
+		if (err) {
+			pr_err("failed to begin node for box '%s' FD '%s'\n",
+			       box->name, box_fd->name);
+			return err;
+		}
+
+		inode_lock(file_inode(file));
+		err = file->f_fdbox_op->kho_write(box_fd, fdt);
+		inode_unlock(file_inode(file));
+		if (err) {
+			pr_err("kho_write failed for box '%s' FD '%s': %d\n",
+			       box->name, box_fd->name, err);
+			return err;
+		}
+
+		err = fdt_end_node(fdt);
+		if (err) {
+			/* TODO: This leaks all pages reserved by kho_write(). */
+			pr_err("failed to end node for box '%s' FD '%s'\n",
+			       box->name, box_fd->name);
+			return err;
+		}
+	}
+
+	return err;
+}
+
+static int fdbox_kho_write_boxes(void *fdt)
+{
+	static const char compatible[] = "fdbox,box-v1";
+	struct fdbox *box;
+	unsigned long idx;
+	int err = 0;
+
+	xa_for_each(&priv.box_list, idx, box) {
+		if (!box->sealed)
+			continue;
+
+		err |= fdt_begin_node(fdt, box->name);
+		err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+		err |= fdbox_kho_write_fds(fdt, box);
+		err |= fdt_end_node(fdt);
+	}
+
+	return err;
+}
+
+static int fdbox_kho_notifier(struct notifier_block *self,
+			      unsigned long cmd,
+			      void *v)
+{
+	static const char compatible[] = "fdbox-v1";
+	void *fdt = v;
+	int err = 0;
+
+	switch (cmd) {
+	case KEXEC_KHO_ABORT:
+		return NOTIFY_DONE;
+	case KEXEC_KHO_DUMP:
+		/* Handled below */
+		break;
+	default:
+		return NOTIFY_BAD;
+	}
+
+	err |= fdt_begin_node(fdt, "fdbox");
+	err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
+	err |= fdbox_kho_write_boxes(fdt);
+	err |= fdt_end_node(fdt);
+
+	return err ? NOTIFY_BAD : NOTIFY_DONE;
+}
+
+static struct notifier_block fdbox_kho_nb = {
+	.notifier_call = fdbox_kho_notifier,
+};
+
+static void fdbox_recover_fd(const void *fdt, int offset, struct fdbox *box,
+			     struct file *(*fn)(const void *fdt, int offset))
+{
+	struct fdbox_fd *box_fd;
+	struct file *file;
+	const char *name;
+
+	name = fdt_get_name(fdt, offset, NULL);
+	if (!name) {
+		pr_err("no name in FDT for FD at offset %d\n", offset);
+		return;
+	}
+
+	file = fn(fdt, offset);
+	if (!file)
+		return;
+
+	scoped_guard(rwsem_read, &box->rwsem) {
+		box_fd = fdbox_put_file(box, name, file);
+		if (IS_ERR(box_fd)) {
+			pr_err("failed to put fd '%s' into box '%s': %ld\n",
+			       box->name, name, PTR_ERR(box_fd));
+			fput(file);
+			return;
+		}
+	}
+}
+
+static void fdbox_kho_recover(void)
+{
+	const void *fdt = kho_get_fdt();
+	const char *path = "/fdbox";
+	int off, box, fd;
+	int err;
+
+	/* Not a KHO boot */
+	if (!fdt)
+		return;
+
+	/*
+	 * When adding handlers this is taken as read. Taking it as write here
+	 * ensures no handlers get added while nodes are being processed,
+	 * eliminating the race of a handler getting added after its node is
+	 * processed, but before the whole recover is done.
+	 */
+	guard(rwsem_write)(&priv.recover_sem);
+
+	off = fdt_path_offset(fdt, path);
+	if (off < 0) {
+		pr_debug("could not find '%s' in DT", path);
+		return;
+	}
+
+	err = fdt_node_check_compatible(fdt, off, "fdbox-v1");
+	if (err) {
+		pr_err("invalid top level compatible\n");
+		return;
+	}
+
+	fdt_for_each_subnode(box, fdt, off) {
+		struct fdbox *new_box;
+
+		err = fdt_node_check_compatible(fdt, box, "fdbox,box-v1");
+		if (err) {
+			pr_err("invalid compatible for box '%s'\n",
+			       fdt_get_name(fdt, box, NULL));
+			continue;
+		}
+
+		new_box = _fdbox_create_box(fdt_get_name(fdt, box, NULL));
+		if (IS_ERR(new_box)) {
+			pr_warn("could not create box '%s'\n",
+				fdt_get_name(fdt, box, NULL));
+			continue;
+		}
+
+		fdt_for_each_subnode(fd, fdt, box) {
+			struct fdbox_handler *handler;
+			const char *compatible;
+			unsigned long idx;
+
+			compatible = fdt_getprop(fdt, fd, "compatible", NULL);
+			if (!compatible) {
+				pr_warn("failed to get compatible for FD '%s'. Skipping.\n",
+					fdt_get_name(fdt, fd, NULL));
+				continue;
+			}
+
+			xa_for_each(&priv.handlers, idx, handler) {
+				if (!strcmp(handler->compatible, compatible))
+					break;
+			}
+
+			if (handler) {
+				fdbox_recover_fd(fdt, fd, new_box, handler->fn);
+			} else {
+				u32 id;
+
+				pr_debug("found no handler for compatible %s. Queueing for later.\n",
+					 compatible);
+
+				if (xa_alloc(&new_box->pending_fds, &id,
+					     xa_mk_value(fd), xa_limit_32b,
+					     GFP_KERNEL)) {
+					pr_warn("failed to queue pending FD '%s' to list\n",
+						fdt_get_name(fdt, fd, NULL));
+				}
+			}
+		}
+
+		new_box->sealed = true;
+	}
+
+	priv.recover_done = true;
+}
+
+static void fdbox_recover_pending(struct fdbox_handler *handler)
+{
+	const void *fdt = kho_get_fdt();
+	unsigned long bid, pid;
+	struct fdbox *box;
+	void *pending;
+
+	if (WARN_ON(!fdt))
+		return;
+
+	xa_for_each(&priv.box_list, bid, box) {
+		xa_for_each(&box->pending_fds, pid, pending) {
+			int off = xa_to_value(pending);
+
+			if (fdt_node_check_compatible(fdt, off, handler->compatible) == 0) {
+				fdbox_recover_fd(fdt, off, box, handler->fn);
+				xa_erase(&box->pending_fds, pid);
+			}
+		}
+	}
+}
+
+int fdbox_register_handler(const char *compatible,
+			   struct file *(*fn)(const void *fdt, int offset))
+{
+	struct xarray *handlers = &priv.handlers;
+	struct fdbox_handler *handler, *cmp;
+	unsigned long idx;
+	int ret;
+	u32 id;
+
+	/* See comment in fdbox_kho_recover(). */
+	guard(rwsem_read)(&priv.recover_sem);
+
+	handler = kmalloc(sizeof(*handler), GFP_KERNEL);
+	if (!handler)
+		return -ENOMEM;
+
+	handler->compatible = compatible;
+	handler->fn = fn;
+
+	xa_lock(handlers);
+	xa_for_each(handlers, idx, cmp) {
+		if (!strcmp(cmp->compatible, compatible)) {
+			xa_unlock(handlers);
+			kfree(handler);
+			return -EEXIST;
+		}
+	}
+
+	ret = __xa_alloc(handlers, &id, handler, xa_limit_32b, GFP_KERNEL);
+	xa_unlock(handlers);
+	if (ret) {
+		kfree(handler);
+		return ret;
+	}
+
+	if (priv.recover_done)
+		fdbox_recover_pending(handler);
+
+	return 0;
+}
+
+static int __init fdbox_init(void)
+{
+	int ret = 0;
+
+	/* /dev/fdbox/$NAME */
+	priv.class = class_create("fdbox");
+	if (IS_ERR(priv.class))
+		return PTR_ERR(priv.class);
+
+	priv.class->devnode = fdbox_devnode;
+
+	ret = alloc_chrdev_region(&priv.box_devt, 0, 1, "fdbox");
+	if (ret)
+		goto err_class;
+
+	ret = misc_register(&fdbox_dev);
+	if (ret) {
+		pr_err("fdbox: misc device register failed\n");
+		goto err_chrdev;
+	}
+
+	if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) {
+		register_kho_notifier(&fdbox_kho_nb);
+		fdbox_kho_recover();
+	}
+
+	return 0;
+
+err_chrdev:
+	unregister_chrdev_region(priv.box_devt, 1);
+	priv.box_devt = 0;
+err_class:
+	class_destroy(priv.class);
+	priv.class = NULL;
+	return ret;
+}
+module_init(fdbox_init);
diff --git a/include/linux/fdbox.h b/include/linux/fdbox.h
new file mode 100644
index 0000000000000..0bc18742940f5
--- /dev/null
+++ b/include/linux/fdbox.h
@@ -0,0 +1,119 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
+ *
+ * Author: Pratyush Yadav <ptyadav@amazon.de>
+ * Author: Alexander Graf <graf@amazon.com>
+ */
+#ifndef _LINUX_FDBOX_H
+#define _LINUX_FDBOX_H
+
+#include <linux/cdev.h>
+#include <linux/device.h>
+#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+#include <uapi/linux/fdbox.h>
+
+/**
+ * struct fdbox - A box of FDs.
+ * @name: Name of the box. Must be unique.
+ * @rwsem: Used to ensure exclusive access to the box during SEAL/UNSEAL
+ *         operations.
+ * @dev: Backing device for the character device.
+ * @cdev: Character device which accepts ioctls from userspace.
+ * @fd_list: List of FDs in the box.
+ * @sealed: Whether the box is sealed or not.
+ */
+struct fdbox {
+	char				name[FDBOX_NAME_LEN];
+	/*
+	 * Taken as read when non-exclusive access is needed and the box can be
+	 * in mutable state. For example, the GET_FD and PUT_FD operations use
+	 * it as read when adding or removing FDs from the box.
+	 *
+	 * Taken as write when exclusive access is needed and the box should be
+	 * in a stable, non-mutable state. For example, the SEAL and UNSEAL
+	 * operations use it as write because they need the list of FDs to be
+	 * stable.
+	 */
+	struct rw_semaphore		rwsem;
+	struct device			dev;
+	struct cdev			cdev;
+	struct xarray			fd_list;
+	struct xarray			pending_fds;
+	bool				sealed;
+};
+
+/**
+ * struct fdbox_fd - An FD in a box.
+ * @name: Name of the FD. Must be unique in the box.
+ * @file: Underlying file for the FD.
+ * @flags: Box flags. Currently, no flags are allowed.
+ * @box: The box to which this FD belongs.
+ */
+struct fdbox_fd {
+	char				name[FDBOX_NAME_LEN];
+	struct file			*file;
+	int				flags;
+	struct fdbox			*box;
+};
+
+/**
+ * struct fdbox_file_ops - operations for files that can be put into a fdbox.
+ */
+struct fdbox_file_ops {
+	/**
+	 * @kho_write: write fd to KHO FDT.
+	 *
+	 * box_fd: Box FD to be serialized.
+	 *
+	 * fdt: KHO FDT
+	 *
+	 * This is called during KHO activation phase to serialize all data
+	 * needed for a FD to be preserved across a KHO.
+	 *
+	 * Returns: 0 on success, -errno on failure. Error here causes KHO
+	 * activation failure.
+	 */
+	int (*kho_write)(struct fdbox_fd *box_fd, void *fdt);
+	/**
+	 * @seal: seal the box
+	 *
+	 * box: Box which is going to be sealed.
+	 *
+	 * This can be set if a file has a dependency on other files. At seal
+	 * time, all the FDs in the box can be inspected to ensure all the
+	 * dependencies are met.
+	 */
+	int (*seal)(struct fdbox *box);
+	/**
+	 * @unseal: unseal the box
+	 *
+	 * box: Box which is going to be sealed.
+	 *
+	 * The opposite of seal. This can be set if a file has a dependency on
+	 * other files. At unseal time, all the FDs in the box can be inspected
+	 * to ensure all the dependencies are met. This can help ensure all
+	 * necessary FDs made it through after a KHO for example.
+	 */
+	int (*unseal)(struct fdbox *box);
+};
+
+/**
+ * fdbox_register_handler - register a handler for recovering Box FDs after KHO.
+ * @compatible: compatible string in the KHO FDT node.
+ * @handler: function to parse the FDT at offset 'offset'.
+ *
+ * After KHO, the FDs in the KHO FDT must be deserialized by the underlying
+ * modules or file systems. Since module initialization can be in any order,
+ * including after FDBox has been initialized, handler registration allows
+ * modules to queue their parsing functions, and FDBox will execute them when it
+ * can.
+ *
+ * Returns: 0 on success, -errno otherwise.
+ */
+int fdbox_register_handler(const char *compatible,
+			   struct file *(*handler)(const void *fdt, int offset));
+#endif /* _LINUX_FDBOX_H */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index be3ad155ec9f7..7d710a5e09b5b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -81,6 +81,9 @@ struct fs_context;
 struct fs_parameter_spec;
 struct fileattr;
 struct iomap_ops;
+struct fdbox;
+struct fdbox_fd;
+struct fdbox_file_ops;
 
 extern void __init inode_init(void);
 extern void __init inode_init_early(void);
@@ -1078,6 +1081,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
  * @f_llist: work queue entrypoint
  * @f_ra: file's readahead state
  * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.)
+ * @f_fdbox_op: FDBOX operations
  */
 struct file {
 	file_ref_t			f_ref;
@@ -1116,6 +1120,9 @@ struct file {
 		freeptr_t		f_freeptr;
 	};
 	/* --- cacheline 3 boundary (192 bytes) --- */
+#ifdef CONFIG_FDBOX
+	const struct fdbox_file_ops	*f_fdbox_op;
+#endif
 } __randomize_layout
   __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
 
diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
index 69e110c2b86a9..fedb873c04453 100644
--- a/include/linux/miscdevice.h
+++ b/include/linux/miscdevice.h
@@ -71,6 +71,7 @@
 #define USERIO_MINOR		240
 #define VHOST_VSOCK_MINOR	241
 #define RFKILL_MINOR		242
+#define FDBOX_MINOR		243
 #define MISC_DYNAMIC_MINOR	255
 
 struct device;
diff --git a/include/uapi/linux/fdbox.h b/include/uapi/linux/fdbox.h
new file mode 100644
index 0000000000000..577ba33b908fd
--- /dev/null
+++ b/include/uapi/linux/fdbox.h
@@ -0,0 +1,61 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * This file contains definitions and structures for fdbox ioctls.
+ *
+ * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
+ *
+ * Author: Pratyush Yadav <ptyadav@amazon.de>
+ * Author: Alexander Graf <graf@amazon.com>
+ */
+#ifndef _UAPI_LINUX_FDBOX_H
+#define _UAPI_LINUX_FDBOX_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define FDBOX_NAME_LEN			256
+
+#define FDBOX_TYPE	('.')
+#define FDBOX_BASE	0
+
+/* Ioctls on /dev/fdbox/fdbox */
+
+/* Create a box. */
+#define FDBOX_CREATE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 0)
+struct fdbox_create_box {
+	__u64 flags;
+	__u8 name[FDBOX_NAME_LEN];
+};
+
+/* Delete a box. */
+#define FDBOX_DELETE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 1)
+struct fdbox_delete_box {
+	__u64 flags;
+	__u8 name[FDBOX_NAME_LEN];
+};
+
+/* Ioctls on /dev/fdbox/$BOXNAME */
+
+/* Put FD into box. This unmaps the FD from the calling process. */
+#define FDBOX_PUT_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 2)
+struct fdbox_put_fd {
+	__u64 flags;
+	__u32 fd;
+	__u32 pad;
+	__u8 name[FDBOX_NAME_LEN];
+};
+
+/* Get the FD from box. This maps the FD into the calling process. */
+#define FDBOX_GET_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 3)
+struct fdbox_get_fd {
+	__u64 flags;
+	__u32 pad;
+	__u8 name[FDBOX_NAME_LEN];
+};
+
+/* Seal the box. After this, no FDs can be put in or taken out of the box. */
+#define FDBOX_SEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 4)
+/* Unseal the box. Opposite of seal. */
+#define FDBOX_UNSEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 5)
+
+#endif /* _UAPI_LINUX_FDBOX_H */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 2/5] misc: add documentation for FDBox
  2025-03-07  0:57 [RFC PATCH 0/5] Introduce FDBox, and preserve memfd with shmem over KHO Pratyush Yadav
  2025-03-07  0:57 ` [RFC PATCH 1/5] misc: introduce FDBox Pratyush Yadav
@ 2025-03-07  0:57 ` Pratyush Yadav
  2025-03-07  2:19   ` Randy Dunlap
  2025-03-07 14:22   ` Jonathan Corbet
  2025-03-07  0:57 ` [RFC PATCH 3/5] mm: shmem: allow callers to specify operations to shmem_undo_range Pratyush Yadav
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pratyush Yadav, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

With FDBox in place, add documentation that describes what it is and how
it is used, along with its UAPI and in-kernel API.

Since the document refers to KHO, add a reference tag in kho/index.rst.

Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---
 Documentation/filesystems/locking.rst |  21 +++
 Documentation/kho/fdbox.rst           | 224 ++++++++++++++++++++++++++
 Documentation/kho/index.rst           |   3 +
 MAINTAINERS                           |   1 +
 4 files changed, 249 insertions(+)
 create mode 100644 Documentation/kho/fdbox.rst

diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index d20a32b77b60f..5526833faf79a 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -607,6 +607,27 @@ used. To block changes to file contents via a memory mapping during the
 operation, the filesystem must take mapping->invalidate_lock to coordinate
 with ->page_mkwrite.
 
+fdbox_file_ops
+==============
+
+prototypes::
+
+	int (*kho_write)(struct fdbox_fd *box_fd, void *fdt);
+	int (*seal)(struct fdbox *box);
+	int (*unseal)(struct fdbox *box);
+
+
+locking rules:
+	all may block
+
+==============	==================================================
+ops		i_rwsem(box_fd->file->f_inode)
+==============	==================================================
+kho_write:	exclusive
+seal:		no
+unseal:		no
+==============	==================================================
+
 dquot_operations
 ================
 
diff --git a/Documentation/kho/fdbox.rst b/Documentation/kho/fdbox.rst
new file mode 100644
index 0000000000000..44a3f5cdf1efb
--- /dev/null
+++ b/Documentation/kho/fdbox.rst
@@ -0,0 +1,224 @@
+.. SPDX-License-Identifier: GPL-2.0-or-later
+
+===========================
+File Descriptor Box (FDBox)
+===========================
+
+:Author: Pratyush Yadav
+
+Introduction
+============
+
+The File Descriptor Box (FDBox) is a mechanism for userspace to name file
+descriptors and give them over to the kernel to hold. They can later be
+retrieved by passing in the same name.
+
+The primary purpose of FDBox is to be used with :ref:`kho`. There are many kinds
+anonymous file descriptors in the kernel like memfd, guest_memfd, iommufd, etc.
+that would be useful to be preserved using KHO. To be able to do that, there
+needs to be a mechanism to label FDs that allows userspace to set the label
+before doing KHO and to use the label to map them back after KHO. FDBox achieves
+that purpose by exposing a miscdevice which exposes ioctls to label and transfer
+FDs between the kernel and userspace. FDBox is not intended to work with any
+generic file descriptor. Support for each kind of FDs must be explicitly
+enabled.
+
+FDBox can be enabled by setting the ``CONFIG_FDBOX`` option to ``y``. While the
+primary purpose of FDBox is to be used with KHO, it does not explicitly require
+``CONFIG_KEXEC_HANDOVER``, since it can be used without KHO, simply as a way to
+preserve or transfer FDs when userspace exits.
+
+Concepts
+========
+
+Box
+---
+
+The box is a container for FDs. Boxes are identified by their name, which must
+be unique. Userspace can put FDs in the box using the ``FDBOX_PUT_FD``
+operation, and take them out of the box using the ``FDBOX_GET_FD`` operation.
+Once all the required FDs are put into the box, it can be sealed to make it
+ready for shipping. This can be done by the ``FDBOX_SEAL`` operation. The seal
+operation notifies each FD in the box. If any of the FDs have a dependency on
+another, this gives them an opportunity to ensure all dependencies are met, or
+fail the seal if not. Once a box is sealed, no FDs can be added or removed from
+the box until it is unsealed. Only sealed boxes are transported to a new kernel
+via KHO. The box can be unsealed by the ``FDBOX_UNSEAL`` operation. This is the
+opposite of seal. It also notifies each FD in the box to ensure all dependencies
+are met. This can be useful in case some FDs fail to be restored after KHO.
+
+Box FD
+------
+
+The Box FD is a FD that is currently in a box. It is identified by its name,
+which must be unique in the box it belongs to. The Box FD is created when a FD
+is put into a box by using the ``FDBOX_PUT_FD`` operation. This operation
+removes the FD from the calling task. The FD can be restored by passing the
+unique name to the ``FDBOX_GET_FD`` operation.
+
+FDBox control device
+--------------------
+
+This is the ``/dev/fdbox/fdbox`` device. A box can be created using the
+``FDBOX_CREATE_BOX`` operation on the device. A box can be removed using the
+``FDBOX_DELETE_BOX`` operation.
+
+UAPI
+====
+
+FDBOX_NAME_LEN
+--------------
+
+.. code-block:: c
+
+    #define FDBOX_NAME_LEN			256
+
+Maximum length of the name of a Box or Box FD.
+
+Ioctls on /dev/fdbox/fdbox
+--------------------------
+
+FDBOX_CREATE_BOX
+~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+    #define FDBOX_CREATE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 0)
+    struct fdbox_create_box {
+    	__u64 flags;
+    	__u8 name[FDBOX_NAME_LEN];
+    };
+
+Create a box.
+
+After this returns, the box is available at ``/dev/fdbox/<name>``.
+
+``name``
+    The name of the box to be created. Must be unique.
+
+``flags``
+    Flags to the operation. Currently, no flags are defined.
+
+Returns:
+    0 on success, -1 on error, with errno set.
+
+FDBOX_DELETE_BOX
+~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+    #define FDBOX_DELETE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 1)
+    struct fdbox_delete_box {
+    	__u64 flags;
+    	__u8 name[FDBOX_NAME_LEN];
+    };
+
+Delete a box.
+
+After this returns, the box is no longer available at ``/dev/fdbox/<name>``.
+
+``name``
+    The name of the box to be deleted.
+
+``flags``
+    Flags to the operation. Currently, no flags are defined.
+
+Returns:
+    0 on success, -1 on error, with errno set.
+
+Ioctls on /dev/fdbox/<boxname>
+------------------------------
+
+These must be performed on the ``/dev/fdbox/<boxname>`` device.
+
+FDBX_PUT_FD
+~~~~~~~~~~~
+
+.. code-block:: c
+
+    #define FDBOX_PUT_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 2)
+    struct fdbox_put_fd {
+    	__u64 flags;
+    	__u32 fd;
+    	__u32 pad;
+    	__u8 name[FDBOX_NAME_LEN];
+    };
+
+
+Put FD into the box.
+
+After this returns, ``fd`` is removed from the task and can no longer be used by
+it.
+
+``name``
+    The name of the FD.
+
+``fd``
+    The file descriptor number to be
+
+``flags``
+    Flags to the operation. Currently, no flags are defined.
+
+Returns:
+    0 on success, -1 on error, with errno set.
+
+FDBX_GET_FD
+~~~~~~~~~~~
+
+.. code-block:: c
+
+    #define FDBOX_GET_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 3)
+    struct fdbox_get_fd {
+    	__u64 flags;
+    	__u8 name[FDBOX_NAME_LEN];
+    };
+
+Get an FD from the box.
+
+After this returns, the FD identified by ``name`` is mapped into the task and is
+available for use.
+
+``name``
+    The name of the FD to get.
+
+``flags``
+    Flags to the operation. Currently, no flags are defined.
+
+Returns:
+    FD number on success, -1 on error with errno set.
+
+FDBOX_SEAL
+~~~~~~~~~~
+
+.. code-block:: c
+
+    #define FDBOX_SEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 4)
+
+Seal the box.
+
+Gives the kernel an opportunity to ensure all dependencies are met in the box.
+After this returns, the box is sealed and FDs can no longer be added or removed
+from it. A box must be sealed for it to be transported across KHO.
+
+Returns:
+    0 on success, -1 on error with errno set.
+
+FDBOX_UNSEAL
+~~~~~~~~~~~~
+
+.. code-block:: c
+
+    #define FDBOX_UNSEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 5)
+
+Unseal the box.
+
+Gives the kernel an opportunity to ensure all dependencies are met in the box,
+and in case of KHO, no FDs have been lost in transit.
+
+Returns:
+    0 on success, -1 on error with errno set.
+
+Kernel functions and structures
+===============================
+
+.. kernel-doc:: include/linux/fdbox.h
diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst
index 5e7eeeca8520f..051513b956075 100644
--- a/Documentation/kho/index.rst
+++ b/Documentation/kho/index.rst
@@ -1,5 +1,7 @@
 .. SPDX-License-Identifier: GPL-2.0-or-later
 
+.. _kho:
+
 ========================
 Kexec Handover Subsystem
 ========================
@@ -9,6 +11,7 @@ Kexec Handover Subsystem
 
    concepts
    usage
+   fdbox
 
 .. only::  subproject and html
 
diff --git a/MAINTAINERS b/MAINTAINERS
index d329d3e5514c5..135427582e60f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8866,6 +8866,7 @@ FDBOX
 M:	Pratyush Yadav <pratyush@kernel.org>
 L:	linux-fsdevel@vger.kernel.org
 S:	Maintained
+F:	Documentation/kho/fdbox.rst
 F:	drivers/misc/fdbox.c
 F:	include/linux/fdbox.h
 F:	include/uapi/linux/fdbox.h
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 3/5] mm: shmem: allow callers to specify operations to shmem_undo_range
  2025-03-07  0:57 [RFC PATCH 0/5] Introduce FDBox, and preserve memfd with shmem over KHO Pratyush Yadav
  2025-03-07  0:57 ` [RFC PATCH 1/5] misc: introduce FDBox Pratyush Yadav
  2025-03-07  0:57 ` [RFC PATCH 2/5] misc: add documentation for FDBox Pratyush Yadav
@ 2025-03-07  0:57 ` Pratyush Yadav
  2025-03-07  0:57 ` [RFC PATCH 4/5] mm: shmem: allow preserving file over FDBOX + KHO Pratyush Yadav
  2025-03-07  0:57 ` [RFC PATCH 5/5] mm/memfd: allow preserving FD " Pratyush Yadav
  4 siblings, 0 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pratyush Yadav, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

In a following patch, support for preserving a shmem file over kexec
handover (KHO) will be added. When a shmem file is to be preserved over
KHO, its pages must be removed from the inode's page cache and kept
reserved. That work is very similar to what shmem_undo_range() does. The
only extra thing that needs to be done is to track the PFN and index of
each page and get an extra refcount on the page to make sure it does not
get freed.

Refactor shmem_undo_range() to accept the ops it should execute for each
folio, along with a cookie to pass along. During undo, three distinct
kinds of operations are made: truncate a folio, truncate a partial
folio, truncate a folio in swap. Add a callback for each of the
operations. Add shmem_default_undo_ops that maintain the old behaviour,
and make callers use that.

Since the ops for KHO might fail (needing to allocate memory, or being
unable to bring a page back from swap for example), there needs to be a
way for them to report errors and stop the undo. Because of this, the
function returns an int instead of void. This has the unfortunate side
effect of implying this function can fail, though during normal usage,
it should never fail. Add some WARNs to ensure that if that assumption
ever changes, it gets caught.

Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---

Notes:
    I did it this way since it seemed to be duplicating the least amount of
    code. The undo logic is fairly complicated, and I was not too keen on
    replicating it elsewhere. On thinking about this again, I am not so sure
    if that was a good idea since the end result looks a bit complicated.

 mm/shmem.c | 165 +++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 134 insertions(+), 31 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index 4ea6109a80431..d6d9266b27b75 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1064,12 +1064,56 @@ static struct folio *shmem_get_partial_folio(struct inode *inode, pgoff_t index)
 	return folio;
 }
 
+struct shmem_undo_range_ops {
+	/* Return -ve on error, or number of entries freed. */
+	long (*undo_swap)(struct address_space *mapping, pgoff_t index,
+			  void *old, void *arg);
+	/* Return -ve on error, 0 on success. */
+	int (*undo_folio)(struct address_space *mapping, struct folio *folio,
+			  void *arg);
+	/*
+	 * Return -ve on error, 0 if splitting failed, 1 if splitting succeeded.
+	 */
+	int (*undo_partial_folio)(struct folio *folio, pgoff_t lstart,
+				  pgoff_t lend, void *arg);
+};
+
+static long shmem_default_undo_swap(struct address_space *mapping, pgoff_t index,
+				    void *old, void *arg)
+{
+	return shmem_free_swap(mapping, index, old);
+}
+
+static int shmem_default_undo_folio(struct address_space *mapping,
+				    struct folio *folio, void *arg)
+{
+	truncate_inode_folio(mapping, folio);
+	return 0;
+}
+
+static int shmem_default_undo_partial_folio(struct folio *folio, pgoff_t lstart,
+					    pgoff_t lend, void *arg)
+{
+	/*
+	 * Function returns bool. Convert it to int and return. No error
+	 * returns needed here.
+	 */
+	return truncate_inode_partial_folio(folio, lstart, lend);
+}
+
+static const struct shmem_undo_range_ops shmem_default_undo_ops = {
+	.undo_swap = shmem_default_undo_swap,
+	.undo_folio = shmem_default_undo_folio,
+	.undo_partial_folio = shmem_default_undo_partial_folio,
+};
+
 /*
  * Remove range of pages and swap entries from page cache, and free them.
  * If !unfalloc, truncate or punch hole; if unfalloc, undo failed fallocate.
  */
-static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
-								 bool unfalloc)
+static int shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
+			    bool unfalloc,
+			    const struct shmem_undo_range_ops *ops, void *arg)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct shmem_inode_info *info = SHMEM_I(inode);
@@ -1081,7 +1125,7 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	bool same_folio;
 	long nr_swaps_freed = 0;
 	pgoff_t index;
-	int i;
+	int i, ret = 0;
 
 	if (lend == -1)
 		end = -1;	/* unsigned, so actually very big */
@@ -1099,17 +1143,31 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			if (xa_is_value(folio)) {
 				if (unfalloc)
 					continue;
-				nr_swaps_freed += shmem_free_swap(mapping,
-							indices[i], folio);
+
+				ret = ops->undo_swap(mapping, indices[i], folio,
+						     arg);
+				if (ret < 0) {
+					folio_unlock(folio);
+					break;
+				}
+
+				nr_swaps_freed += ret;
 				continue;
 			}
 
-			if (!unfalloc || !folio_test_uptodate(folio))
-				truncate_inode_folio(mapping, folio);
+			if (!unfalloc || !folio_test_uptodate(folio)) {
+				ret = ops->undo_folio(mapping, folio, arg);
+				if (ret < 0) {
+					folio_unlock(folio);
+					break;
+				}
+			}
 			folio_unlock(folio);
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
+		if (ret < 0)
+			goto out;
 		cond_resched();
 	}
 
@@ -1127,7 +1185,13 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 	if (folio) {
 		same_folio = lend < folio_pos(folio) + folio_size(folio);
 		folio_mark_dirty(folio);
-		if (!truncate_inode_partial_folio(folio, lstart, lend)) {
+		ret = ops->undo_partial_folio(folio, lstart, lend, arg);
+		if (ret < 0) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto out;
+		}
+		if (ret == 0) {
 			start = folio_next_index(folio);
 			if (same_folio)
 				end = folio->index;
@@ -1141,7 +1205,14 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 		folio = shmem_get_partial_folio(inode, lend >> PAGE_SHIFT);
 	if (folio) {
 		folio_mark_dirty(folio);
-		if (!truncate_inode_partial_folio(folio, lstart, lend))
+		ret = ops->undo_partial_folio(folio, lstart, lend, arg);
+		if (ret < 0) {
+			folio_unlock(folio);
+			folio_put(folio);
+			goto out;
+		}
+
+		if (ret == 0)
 			end = folio->index;
 		folio_unlock(folio);
 		folio_put(folio);
@@ -1166,18 +1237,21 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 			folio = fbatch.folios[i];
 
 			if (xa_is_value(folio)) {
-				long swaps_freed;
-
 				if (unfalloc)
 					continue;
-				swaps_freed = shmem_free_swap(mapping, indices[i], folio);
-				if (!swaps_freed) {
+
+				ret = ops->undo_swap(mapping, indices[i], folio,
+						     arg);
+				if (ret < 0) {
+					break;
+				} else if (ret == 0) {
 					/* Swap was replaced by page: retry */
 					index = indices[i];
 					break;
+				} else {
+					nr_swaps_freed += ret;
+					continue;
 				}
-				nr_swaps_freed += swaps_freed;
-				continue;
 			}
 
 			folio_lock(folio);
@@ -1193,35 +1267,58 @@ static void shmem_undo_range(struct inode *inode, loff_t lstart, loff_t lend,
 						folio);
 
 				if (!folio_test_large(folio)) {
-					truncate_inode_folio(mapping, folio);
-				} else if (truncate_inode_partial_folio(folio, lstart, lend)) {
-					/*
-					 * If we split a page, reset the loop so
-					 * that we pick up the new sub pages.
-					 * Otherwise the THP was entirely
-					 * dropped or the target range was
-					 * zeroed, so just continue the loop as
-					 * is.
-					 */
-					if (!folio_test_large(folio)) {
+					ret = ops->undo_folio(mapping, folio,
+							      arg);
+					if (ret < 0) {
 						folio_unlock(folio);
-						index = start;
 						break;
 					}
+				} else {
+					ret = ops->undo_partial_folio(folio, lstart, lend, arg);
+					if (ret < 0) {
+						folio_unlock(folio);
+						break;
+					}
+
+					if (ret) {
+						/*
+						 * If we split a page, reset the loop so
+						 * that we pick up the new sub pages.
+						 * Otherwise the THP was entirely
+						 * dropped or the target range was
+						 * zeroed, so just continue the loop as
+						 * is.
+						 */
+						if (!folio_test_large(folio)) {
+							folio_unlock(folio);
+							index = start;
+							break;
+						}
+					}
 				}
 			}
 			folio_unlock(folio);
 		}
 		folio_batch_remove_exceptionals(&fbatch);
 		folio_batch_release(&fbatch);
+		if (ret < 0)
+			goto out;
 	}
 
+	ret = 0;
+out:
 	shmem_recalc_inode(inode, 0, -nr_swaps_freed);
+	return ret;
 }
 
 void shmem_truncate_range(struct inode *inode, loff_t lstart, loff_t lend)
 {
-	shmem_undo_range(inode, lstart, lend, false);
+	int ret;
+
+	ret = shmem_undo_range(inode, lstart, lend, false,
+			       &shmem_default_undo_ops, NULL);
+
+	WARN(ret < 0, "shmem_undo_range() should never fail with default ops");
 	inode_set_mtime_to_ts(inode, inode_set_ctime_current(inode));
 	inode_inc_iversion(inode);
 }
@@ -3740,9 +3837,15 @@ static long shmem_fallocate(struct file *file, int mode, loff_t offset,
 			info->fallocend = undo_fallocend;
 			/* Remove the !uptodate folios we added */
 			if (index > start) {
-				shmem_undo_range(inode,
-				    (loff_t)start << PAGE_SHIFT,
-				    ((loff_t)index << PAGE_SHIFT) - 1, true);
+				int ret;
+
+				ret = shmem_undo_range(inode,
+						       (loff_t)start << PAGE_SHIFT,
+						       ((loff_t)index << PAGE_SHIFT) - 1,
+						       true,
+						       &shmem_default_undo_ops,
+						       NULL);
+				WARN(ret < 0, "shmem_undo_range() should never fail with default ops");
 			}
 			goto undone;
 		}
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 4/5] mm: shmem: allow preserving file over FDBOX + KHO
  2025-03-07  0:57 [RFC PATCH 0/5] Introduce FDBox, and preserve memfd with shmem over KHO Pratyush Yadav
                   ` (2 preceding siblings ...)
  2025-03-07  0:57 ` [RFC PATCH 3/5] mm: shmem: allow callers to specify operations to shmem_undo_range Pratyush Yadav
@ 2025-03-07  0:57 ` Pratyush Yadav
  2025-03-07  0:57 ` [RFC PATCH 5/5] mm/memfd: allow preserving FD " Pratyush Yadav
  4 siblings, 0 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pratyush Yadav, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

For applications with a large amount of memory that takes time to
rebuild, reboots to consume kernel upgrades can be very expensive. FDBox
allows preserving file descriptors over kexec using KHO. Combining that
with memfd gives those applications reboot-persistent memory that they
can use to quickly save and reconstruct that state.

Since memfd is backed either by hugetlbfs or shmem, use shmem as the
first backend for memfd that is FDBOX + KHO capable.

To preserve the file's contents during KHO activation, the file's page
cache must be walked and all entries removed, and their indices stored.
Use the newly introduced shmem_undo_range_ops to achieve this. Walk each
entry and before truncating it, take a refcount on the folio so it does
not get freed, and store its physical address and index in the kho_mem
and indices arrays.

Swap pages, partial folios, and huge folios are not supported yet.
Encountering those results in an error.

On the restore side, an empty file is created and then the mems array
walked to insert the pages into the page cache. The logic in
shmem_alloc_and_add_folio() is roughly followed.

Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---
 include/linux/shmem_fs.h |   6 +
 mm/shmem.c               | 333 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 339 insertions(+)

diff --git a/include/linux/shmem_fs.h b/include/linux/shmem_fs.h
index 0b273a7b9f01d..263416f357fe1 100644
--- a/include/linux/shmem_fs.h
+++ b/include/linux/shmem_fs.h
@@ -205,6 +205,12 @@ extern int shmem_mfill_atomic_pte(pmd_t *dst_pmd,
 #endif /* CONFIG_SHMEM */
 #endif /* CONFIG_USERFAULTFD */
 
+#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER)
+bool is_node_shmem(const void *fdt, int offset);
+int shmem_fdbox_kho_write(struct fdbox_fd *ffd, void *fdt);
+struct file *shmem_fdbox_kho_recover(const void *fdt, int offset);
+#endif
+
 /*
  * Used space is stored as unsigned 64-bit value in bytes but
  * quota core supports only signed 64-bit values so use that
diff --git a/mm/shmem.c b/mm/shmem.c
index d6d9266b27b75..c2efdb34a1a18 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -41,6 +41,12 @@
 #include <linux/swapfile.h>
 #include <linux/iversion.h>
 #include <linux/unicode.h>
+#include <linux/libfdt.h>
+#include <linux/fdbox.h>
+#include <linux/vmalloc.h>
+#include <linux/kexec.h>
+#include <linux/kexec_handover.h>
+#include <linux/cleanup.h>
 #include "swap.h"
 
 static struct vfsmount *shm_mnt __ro_after_init;
@@ -5283,6 +5289,333 @@ static int shmem_error_remove_folio(struct address_space *mapping,
 	return 0;
 }
 
+#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER)
+static const char fdbox_kho_compatible[] = "fdbox,shmem-v1";
+
+bool is_node_shmem(const void *fdt, int offset)
+{
+	return fdt_node_check_compatible(fdt, offset, fdbox_kho_compatible) == 0;
+}
+
+struct shmem_fdbox_put_arg {
+	struct kho_mem *mems;
+	unsigned long *indices;
+	unsigned long nr_mems;
+	unsigned long idx;
+};
+
+static long shmem_fdbox_undo_swap(struct address_space *mapping, pgoff_t index,
+				  void *old, void *arg)
+{
+	return -EOPNOTSUPP;
+}
+
+static int shmem_fdbox_undo_folio(struct address_space *mapping,
+				  struct folio *folio, void *__arg)
+{
+	struct shmem_fdbox_put_arg *arg = __arg;
+	struct kho_mem *mem;
+
+	if (arg->idx == arg->nr_mems)
+		return -ENOSPC;
+
+	if (folio_nr_pages(folio) != 1)
+		return -EOPNOTSUPP;
+
+	/*
+	 * Grab an extra refcount to the folio so it sticks around after
+	 * truncation.
+	 */
+	folio_get(folio);
+
+	mem = arg->mems + arg->idx;
+
+	mem->addr = PFN_PHYS(folio_pfn(folio));
+	mem->size = PAGE_SIZE;
+	arg->indices[arg->idx] = folio_index(folio);
+	arg->idx++;
+
+	truncate_inode_folio(mapping, folio);
+	return 0;
+}
+
+static int shmem_fdbox_undo_partial_folio(struct folio *folio, pgoff_t lstart,
+					  pgoff_t lend, void *arg)
+{
+	return -EOPNOTSUPP;
+}
+
+static const struct shmem_undo_range_ops shmem_fdbox_undo_ops = {
+	.undo_swap = shmem_fdbox_undo_swap,
+	.undo_folio = shmem_fdbox_undo_folio,
+	.undo_partial_folio = shmem_fdbox_undo_partial_folio,
+};
+
+static struct kho_mem *shmem_fdbox_kho_get_mems(struct inode *inode,
+						unsigned long **indicesp,
+						unsigned long *nr)
+{
+	struct shmem_inode_info *info = SHMEM_I(inode);
+	unsigned long *indices __free(kvfree) = NULL;
+	struct kho_mem *mems __free(kvfree) = NULL;
+	struct shmem_fdbox_put_arg arg;
+	unsigned long nr_mems;
+	int ret, i;
+
+	scoped_guard(spinlock, &info->lock) {
+		/* TODO: Support swapped pages. Perhaps swap them back in? */
+		if (info->swapped)
+			return ERR_PTR(-EOPNOTSUPP);
+
+		/*
+		 * Estimate the size of the array using the size of the inode,
+		 * assuming there are no contiguous pages.
+		 */
+		nr_mems = info->alloced;
+	}
+
+	mems = kvmalloc_array(nr_mems, sizeof(*mems), GFP_KERNEL);
+	if (!mems)
+		return ERR_PTR(-ENOMEM);
+
+	indices = kvmalloc_array(nr_mems, sizeof(*indices), GFP_KERNEL);
+	if (!indices)
+		return ERR_PTR(-ENOMEM);
+
+	arg.mems = mems;
+	arg.indices = indices;
+	arg.nr_mems = nr_mems;
+	arg.idx = 0;
+
+	ret = shmem_undo_range(inode, 0, -1, false, &shmem_fdbox_undo_ops, &arg);
+	if (ret < 0) {
+		pr_err("shmem: failed to undo fdbox range: %d\n", ret);
+		goto err;
+	}
+
+	*nr = arg.idx;
+	*indicesp = no_free_ptr(indices);
+	return_ptr(mems);
+
+err:
+	/*
+	 * TODO: This kills the whole file on failure to KHO. We should keep the
+	 * contents around for another try later. The problem is, if re-adding
+	 * pages fails, there would be no recovery at that point. Ideally, we
+	 * should first serialize the whole file, and only then remove things
+	 * from page cache so we are sure to never fail.
+	 */
+	for (i = 0; i < arg.idx; i++) {
+		struct folio *folio = page_folio(phys_to_page(mems[i].addr));
+
+		folio_put(folio);
+	}
+
+	/* Undo the rest of the file. This should not fail. */
+	WARN_ON(shmem_undo_range(inode, 0, -1, false, &shmem_default_undo_ops, NULL));
+	return ERR_PTR(ret);
+}
+
+int shmem_fdbox_kho_write(struct fdbox_fd *box_fd, void *fdt)
+{
+	struct inode *inode = box_fd->file->f_inode;
+	unsigned long *indices __free(kvfree) = NULL;
+	struct kho_mem *mems __free(kvfree) = NULL;
+	u64 pos = box_fd->file->f_pos, size = inode->i_size;
+	unsigned long nr_mems, i;
+	int ret = 0;
+
+	/*
+	 * mems can be larger than sizeof(*mems) * nr_mems, but we should only
+	 * look at things in the range of 0 to nr_mems.
+	 */
+	mems = shmem_fdbox_kho_get_mems(inode, &indices, &nr_mems);
+	if (IS_ERR(mems))
+		return PTR_ERR(mems);
+
+	/*
+	 * fdbox should have already started the node. We can start adding
+	 * properties directly.
+	 */
+	ret |= fdt_property(fdt, "compatible", fdbox_kho_compatible,
+			    sizeof(fdbox_kho_compatible));
+	ret |= fdt_property(fdt, "pos", &pos, sizeof(u64));
+	ret |= fdt_property(fdt, "size",  &size, sizeof(u64));
+	ret |= fdt_property(fdt, "mem", mems, sizeof(*mems) * nr_mems);
+	ret |= fdt_property(fdt, "indices", indices, sizeof(*indices) * nr_mems);
+
+	if (ret) {
+		pr_err("shmem: failed to add properties to FDT!\n");
+		ret = -EINVAL;
+		goto err;
+	}
+
+	return 0;
+
+err:
+	/*
+	 * TODO: This kills the whole file on failure to KHO. We should keep the
+	 * contents around for another try later. The problem is, if re-adding
+	 * pages fails, there would be no recovery at that point. Ideally, we
+	 * should first serialize the whole file, and only then remove things
+	 * from page cache so we are sure to never fail.
+	 */
+	for (i = 0; i < nr_mems; i++) {
+		struct folio *folio = page_folio(phys_to_page(mems[i].addr));
+
+		folio_put(folio);
+	}
+	return ret;
+}
+
+struct file *shmem_fdbox_kho_recover(const void *fdt, int offset)
+{
+	struct address_space *mapping;
+	char pathbuf[1024] = "", *path;
+	const unsigned long *indices;
+	const struct kho_mem *mems;
+	unsigned long nr_mems, i = 0;
+	const u64 *pos, *size;
+	struct inode *inode;
+	struct file *file;
+	int len, ret;
+
+	ret = fdt_node_check_compatible(fdt, offset, fdbox_kho_compatible);
+	if (ret) {
+		pr_err("shmem: invalid compatible\n");
+		goto err;
+	}
+
+	mems = fdt_getprop(fdt, offset, "mem", &len);
+	if (!mems || len % sizeof(*mems)) {
+		pr_err("shmem: invalid mems property\n");
+		goto err;
+	}
+	nr_mems = len / sizeof(*mems);
+
+	indices = fdt_getprop(fdt, offset, "indices", &len);
+	if (!indices || len % sizeof(unsigned long)) {
+		pr_err("shmem: invalid indices property\n");
+		goto err_return;
+	}
+	if (len / sizeof(unsigned long) != nr_mems) {
+		pr_err("shmem: number of indices and mems do not match\n");
+		goto err_return;
+	}
+
+	size = fdt_getprop(fdt, offset, "size", &len);
+	if (!size || len != sizeof(u64)) {
+		pr_err("shmem: invalid size property\n");
+		goto err_return;
+	}
+
+	pos = fdt_getprop(fdt, offset, "pos", &len);
+	if (!pos || len != sizeof(u64)) {
+		pr_err("shmem: invalid pos property\n");
+		goto err_return;
+	}
+
+	/*
+	 * TODO: This sets UID/GID, cgroup accounting to root. Should this
+	 * be given to the first user that maps the FD instead?
+	 */
+	file = shmem_file_setup(fdt_get_name(fdt, offset, NULL), 0,
+				VM_NORESERVE);
+	if (IS_ERR(file)) {
+		pr_err("shmem: failed to setup file\n");
+		goto err_return;
+	}
+
+	inode = file->f_inode;
+	mapping = inode->i_mapping;
+	vfs_setpos(file, *pos, MAX_LFS_FILESIZE);
+
+	for (; i < nr_mems; i++) {
+		struct folio *folio;
+		void *va;
+
+		if (mems[i].size != PAGE_SIZE) {
+			pr_err("shmem: unknown kho_mem size %llx. Expected %lx\n",
+			       mems[i].size, PAGE_SIZE);
+			goto err_return;
+		}
+
+		va = kho_claim_mem(&mems[i]);
+		folio = virt_to_folio(va);
+
+		/* Set up the folio for insertion. */
+
+		/*
+		 * TODO: This breaks falloc-ed folios since now they get marked
+		 * uptodate when they might not actually be zeroed out yet. Need
+		 * a way to distinguish falloc-ed folios.
+		 */
+		folio_mark_uptodate(folio);
+		folio_mark_dirty(folio);
+
+		/*
+		 * TODO: Should find a way to unify this and
+		 * shmem_alloc_and_add_folio().
+		 */
+		__folio_set_locked(folio);
+		__folio_set_swapbacked(folio);
+
+		ret = mem_cgroup_charge(folio, NULL, mapping_gfp_mask(mapping));
+		if (ret) {
+			folio_unlock(folio);
+			folio_put(folio);
+			fput(file);
+			pr_err("shmem: failed to charge folio index %lu\n", i);
+			goto err_return_next;
+		}
+
+		ret = shmem_add_to_page_cache(folio, mapping, indices[i], NULL,
+					      mapping_gfp_mask(mapping));
+		if (ret) {
+			folio_unlock(folio);
+			folio_put(folio);
+			fput(file);
+			pr_err("shmem: failed to add to page cache folio index %lu\n", i);
+			goto err_return_next;
+		}
+
+		ret = shmem_inode_acct_blocks(inode, 1);
+		if (ret) {
+			folio_unlock(folio);
+			folio_put(folio);
+			fput(file);
+			pr_err("shmem: failed to account folio index %lu\n", i);
+			goto err_return_next;
+		}
+
+		shmem_recalc_inode(inode, 1, 0);
+		folio_add_lru(folio);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	inode->i_size = *size;
+
+	return file;
+
+err_return:
+	kho_return_mem(mems + i);
+err_return_next:
+	for (i = i + 1; i < nr_mems; i++)
+		kho_return_mem(mems + i);
+err:
+	ret = fdt_get_path(fdt, offset, pathbuf, sizeof(pathbuf));
+	if (ret)
+		path = "unknown";
+	else
+		path = pathbuf;
+
+	pr_err("shmem: error when recovering KHO node '%s'\n", path);
+	return NULL;
+}
+
+#endif /* CONFIG_FDBOX && CONFIG_KEXEC_HANDOVER */
+
 static const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.dirty_folio	= noop_dirty_folio,
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [RFC PATCH 5/5] mm/memfd: allow preserving FD over FDBOX + KHO
  2025-03-07  0:57 [RFC PATCH 0/5] Introduce FDBox, and preserve memfd with shmem over KHO Pratyush Yadav
                   ` (3 preceding siblings ...)
  2025-03-07  0:57 ` [RFC PATCH 4/5] mm: shmem: allow preserving file over FDBOX + KHO Pratyush Yadav
@ 2025-03-07  0:57 ` Pratyush Yadav
  4 siblings, 0 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07  0:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: Pratyush Yadav, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

For applications with a large amount of memory that takes time to
rebuild, reboots to consume kernel upgrades can be very expensive. FDBox
allows preserving file descriptors over kexec using KHO. Combining that
with memfd gives those applications reboot-persistent memory that they
can use to quickly save and reconstruct that state.

While memfd is backed by either hugetlbfs or shmem, currently only
support on shmem is added for this. Allow saving and restoring shmem FDs
over FDBOX + KHO.

The memfd FDT node itself does not contain much information. It just
creates a subnode and passes it over to shmem to do its thing. Similar
behaviour is followed on the restore side.

Since there are now two paths of getting a shmem file, refactor the file
setup into its own function called memfd_setup_file(). It sets up the
file flags, mode, etc., and sets fdbox ops if enabled.

Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
---
 mm/memfd.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 116 insertions(+), 12 deletions(-)

diff --git a/mm/memfd.c b/mm/memfd.c
index 37f7be57c2f50..1c32e66197f6d 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -7,6 +7,8 @@
  * This file is released under the GPL.
  */
 
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include <linux/fs.h>
 #include <linux/vfs.h>
 #include <linux/pagemap.h>
@@ -19,8 +21,12 @@
 #include <linux/shmem_fs.h>
 #include <linux/memfd.h>
 #include <linux/pid_namespace.h>
+#include <linux/fdbox.h>
+#include <linux/libfdt.h>
 #include <uapi/linux/memfd.h>
 
+static const struct fdbox_file_ops memfd_fdbox_fops;
+
 /*
  * We need a tag: a new tag would expand every xa_node by 8 bytes,
  * so reuse a tag which we firmly believe is never set or cleared on tmpfs
@@ -418,21 +424,10 @@ static char *alloc_name(const char __user *uname)
 	return ERR_PTR(error);
 }
 
-static struct file *alloc_file(const char *name, unsigned int flags)
+static void memfd_setup_file(struct file *file, unsigned int flags)
 {
 	unsigned int *file_seals;
-	struct file *file;
 
-	if (flags & MFD_HUGETLB) {
-		file = hugetlb_file_setup(name, 0, VM_NORESERVE,
-					HUGETLB_ANONHUGE_INODE,
-					(flags >> MFD_HUGE_SHIFT) &
-					MFD_HUGE_MASK);
-	} else {
-		file = shmem_file_setup(name, 0, VM_NORESERVE);
-	}
-	if (IS_ERR(file))
-		return file;
 	file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
 	file->f_flags |= O_LARGEFILE;
 
@@ -452,6 +447,27 @@ static struct file *alloc_file(const char *name, unsigned int flags)
 			*file_seals &= ~F_SEAL_SEAL;
 	}
 
+#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER)
+	file->f_fdbox_op = &memfd_fdbox_fops;
+#endif
+}
+
+static struct file *alloc_file(const char *name, unsigned int flags)
+{
+	struct file *file;
+
+	if (flags & MFD_HUGETLB) {
+		file = hugetlb_file_setup(name, 0, VM_NORESERVE,
+					  HUGETLB_ANONHUGE_INODE,
+					  (flags >> MFD_HUGE_SHIFT) &
+					  MFD_HUGE_MASK);
+	} else {
+		file = shmem_file_setup(name, 0, VM_NORESERVE);
+	}
+	if (IS_ERR(file))
+		return file;
+
+	memfd_setup_file(file, flags);
 	return file;
 }
 
@@ -493,3 +509,91 @@ SYSCALL_DEFINE2(memfd_create,
 	kfree(name);
 	return error;
 }
+
+#if defined(CONFIG_FDBOX) && defined(CONFIG_KEXEC_HANDOVER)
+static const char memfd_fdbox_compatible[] = "fdbox,memfd-v1";
+
+static struct file *memfd_fdbox_kho_recover(const void *fdt, int offset)
+{
+	struct file *file;
+	int ret, subnode;
+
+	ret = fdt_node_check_compatible(fdt, offset, memfd_fdbox_compatible);
+	if (ret) {
+		pr_err("kho: invalid compatible\n");
+		return NULL;
+	}
+
+	/* Make sure there is exactly one subnode. */
+	subnode = fdt_first_subnode(fdt, offset);
+	if (subnode < 0) {
+		pr_err("kho: no subnode for underlying storage found!\n");
+		return NULL;
+	}
+	if (fdt_next_subnode(fdt, subnode) >= 0) {
+		pr_err("kho: too many subnodes. Expected only 1.\n");
+		return NULL;
+	}
+
+	if (is_node_shmem(fdt, subnode)) {
+		file = shmem_fdbox_kho_recover(fdt, subnode);
+		if (!file)
+			return NULL;
+
+		memfd_setup_file(file, 0);
+		return file;
+	}
+
+	return NULL;
+}
+
+static int memfd_fdbox_kho_write(struct fdbox_fd *box_fd, void *fdt)
+{
+	int ret = 0;
+
+	ret |= fdt_property(fdt, "compatible", memfd_fdbox_compatible,
+			    sizeof(memfd_fdbox_compatible));
+
+	/* TODO: Track seals on the file as well. */
+
+	ret |= fdt_begin_node(fdt, "");
+	if (ret) {
+		pr_err("kho: failed to set up memfd node\n");
+		return -EINVAL;
+	}
+
+	if (shmem_file(box_fd->file))
+		ret = shmem_fdbox_kho_write(box_fd, fdt);
+	else
+		/* TODO: HugeTLB support. */
+		ret = -EOPNOTSUPP;
+
+	if (ret)
+		return ret;
+
+	ret = fdt_end_node(fdt);
+	if (ret) {
+		pr_err("kho: failed to end memfd node!\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static const struct fdbox_file_ops memfd_fdbox_fops = {
+	.kho_write = memfd_fdbox_kho_write,
+};
+
+static int __init memfd_fdbox_init(void)
+{
+	int error;
+
+	error = fdbox_register_handler(memfd_fdbox_compatible,
+				       memfd_fdbox_kho_recover);
+	if (error)
+		pr_err("Could not register fdbox handler: %d\n", error);
+
+	return 0;
+}
+late_initcall(memfd_fdbox_init);
+#endif /* CONFIG_FDBOX && CONFIG_KEXEC_HANDOVER */
-- 
2.47.1


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/5] misc: add documentation for FDBox
  2025-03-07  0:57 ` [RFC PATCH 2/5] misc: add documentation for FDBox Pratyush Yadav
@ 2025-03-07  2:19   ` Randy Dunlap
  2025-03-07 15:03     ` Pratyush Yadav
  2025-03-07 14:22   ` Jonathan Corbet
  1 sibling, 1 reply; 29+ messages in thread
From: Randy Dunlap @ 2025-03-07  2:19 UTC (permalink / raw)
  To: Pratyush Yadav, linux-kernel
  Cc: Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

On March 6, 2025 4:57:36 PM PST, Pratyush Yadav <ptyadav@amazon.de> wrote:
>With FDBox in place, add documentation that describes what it is and how
>it is used, along with its UAPI and in-kernel API.
>
>Since the document refers to KHO, add a reference tag in kho/index.rst.
>
>Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
>---
> Documentation/filesystems/locking.rst |  21 +++
> Documentation/kho/fdbox.rst           | 224 ++++++++++++++++++++++++++
> Documentation/kho/index.rst           |   3 +
> MAINTAINERS                           |   1 +
> 4 files changed, 249 insertions(+)
> create mode 100644 Documentation/kho/fdbox.rst
>
>diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
>index d20a32b77b60f..5526833faf79a 100644
>--- a/Documentation/filesystems/locking.rst
>+++ b/Documentation/filesystems/locking.rst
>@@ -607,6 +607,27 @@ used. To block changes to file contents via a memory mapping during the
> operation, the filesystem must take mapping->invalidate_lock to coordinate
> with ->page_mkwrite.
> 
>+fdbox_file_ops
>+==============
>+
>+prototypes::
>+
>+	int (*kho_write)(struct fdbox_fd *box_fd, void *fdt);
>+	int (*seal)(struct fdbox *box);
>+	int (*unseal)(struct fdbox *box);
>+
>+
>+locking rules:
>+	all may block
>+
>+==============	==================================================
>+ops		i_rwsem(box_fd->file->f_inode)
>+==============	==================================================
>+kho_write:	exclusive
>+seal:		no
>+unseal:		no
>+==============	==================================================
>+
> dquot_operations
> ================
> 
>diff --git a/Documentation/kho/fdbox.rst b/Documentation/kho/fdbox.rst
>new file mode 100644
>index 0000000000000..44a3f5cdf1efb
>--- /dev/null
>+++ b/Documentation/kho/fdbox.rst
>@@ -0,0 +1,224 @@
>+.. SPDX-License-Identifier: GPL-2.0-or-later
>+
>+===========================
>+File Descriptor Box (FDBox)
>+===========================
>+
>+:Author: Pratyush Yadav
>+
>+Introduction
>+============
>+
>+The File Descriptor Box (FDBox) is a mechanism for userspace to name file
>+descriptors and give them over to the kernel to hold. They can later be
>+retrieved by passing in the same name.
>+
>+The primary purpose of FDBox is to be used with :ref:`kho`. There are many kinds

    many kinds of 

>+anonymous file descriptors in the kernel like memfd, guest_memfd, iommufd, etc.

   etc.,

>+that would be useful to be preserved using KHO. To be able to do that, there
>+needs to be a mechanism to label FDs that allows userspace to set the label
>+before doing KHO and to use the label to map them back after KHO. FDBox achieves
>+that purpose by exposing a miscdevice which exposes ioctls to label and transfer
>+FDs between the kernel and userspace. FDBox is not intended to work with any
>+generic file descriptor. Support for each kind of FDs must be explicitly
>+enabled.
>+
>+FDBox can be enabled by setting the ``CONFIG_FDBOX`` option to ``y``. While the
>+primary purpose of FDBox is to be used with KHO, it does not explicitly require
>+``CONFIG_KEXEC_HANDOVER``, since it can be used without KHO, simply as a way to
>+preserve or transfer FDs when userspace exits.
>+
>+Concepts
>+========
>+
>+Box
>+---
>+
>+The box is a container for FDs. Boxes are identified by their name, which must
>+be unique. Userspace can put FDs in the box using the ``FDBOX_PUT_FD``
>+operation, and take them out of the box using the ``FDBOX_GET_FD`` operation.

Is this ioctl range documented is ioctl-number.rst?
I didn't notice a patch for that.

>+Once all the required FDs are put into the box, it can be sealed to make it
>+ready for shipping. This can be done by the ``FDBOX_SEAL`` operation. The seal
>+operation notifies each FD in the box. If any of the FDs have a dependency on
>+another, this gives them an opportunity to ensure all dependencies are met, or
>+fail the seal if not. Once a box is sealed, no FDs can be added or removed from
>+the box until it is unsealed. Only sealed boxes are transported to a new kernel

What if KHO is not being used?

>+via KHO. The box can be unsealed by the ``FDBOX_UNSEAL`` operation. This is the
>+opposite of seal. It also notifies each FD in the box to ensure all dependencies
>+are met. This can be useful in case some FDs fail to be restored after KHO.
>+
>+Box FD
>+------

I can't tell in my email font, but is each underlinoat least as long as the title above it?

>+
>+The Box FD is a FD that is currently in a box. It is identified by its name,
>+which must be unique in the box it belongs to. The Box FD is created when a FD
>+is put into a box by using the ``FDBOX_PUT_FD`` operation. This operation
>+removes the FD from the calling task. The FD can be restored by passing the
>+unique name to the ``FDBOX_GET_FD`` operation.
>+
>+FDBox control device
>+--------------------
>+
>+This is the ``/dev/fdbox/fdbox`` device. A box can be created using the
>+``FDBOX_CREATE_BOX`` operation on the device. A box can be removed using the
>+``FDBOX_DELETE_BOX`` operation.
>+
>+UAPI
>+====
>+
>+FDBOX_NAME_LEN
>+--------------
>+
>+.. code-block:: c
>+
>+    #define FDBOX_NAME_LEN			256
>+
>+Maximum length of the name of a Box or Box FD.
>+
>+Ioctls on /dev/fdbox/fdbox
>+--------------------------
>+
>+FDBOX_CREATE_BOX
>+~~~~~~~~~~~~~~~~
>+
>+.. code-block:: c
>+
>+    #define FDBOX_CREATE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 0)
>+    struct fdbox_create_box {
>+    	__u64 flags;
>+    	__u8 name[FDBOX_NAME_LEN];
>+    };
>+
>+Create a box.
>+
>+After this returns, the box is available at ``/dev/fdbox/<name>``.
>+
>+``name``
>+    The name of the box to be created. Must be unique.
>+
>+``flags``
>+    Flags to the operation. Currently, no flags are defined.
>+
>+Returns:
>+    0 on success, -1 on error, with errno set.
>+
>+FDBOX_DELETE_BOX
>+~~~~~~~~~~~~~~~~
>+
>+.. code-block:: c
>+
>+    #define FDBOX_DELETE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 1)
>+    struct fdbox_delete_box {
>+    	__u64 flags;
>+    	__u8 name[FDBOX_NAME_LEN];
>+    };
>+
>+Delete a box.
>+
>+After this returns, the box is no longer available at ``/dev/fdbox/<name>``.
>+
>+``name``
>+    The name of the box to be deleted.
>+
>+``flags``
>+    Flags to the operation. Currently, no flags are defined.
>+
>+Returns:
>+    0 on success, -1 on error, with errno set.
>+
>+Ioctls on /dev/fdbox/<boxname>
>+------------------------------
>+
>+These must be performed on the ``/dev/fdbox/<boxname>`` device.
>+
>+FDBX_PUT_FD
>+~~~~~~~~~~~
>+
>+.. code-block:: c
>+
>+    #define FDBOX_PUT_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 2)
>+    struct fdbox_put_fd {
>+    	__u64 flags;
>+    	__u32 fd;
>+    	__u32 pad;
>+    	__u8 name[FDBOX_NAME_LEN];
>+    };
>+
>+
>+Put FD into the box.
>+
>+After this returns, ``fd`` is removed from the task and can no longer be used by
>+it.
>+
>+``name``
>+    The name of the FD.
>+
>+``fd``
>+    The file descriptor number to be
>+
>+``flags``
>+    Flags to the operation. Currently, no flags are defined.
>+
>+Returns:
>+    0 on success, -1 on error, with errno set.
>+
>+FDBX_GET_FD
>+~~~~~~~~~~~
>+
>+.. code-block:: c
>+
>+    #define FDBOX_GET_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 3)
>+    struct fdbox_get_fd {
>+    	__u64 flags;
>+    	__u8 name[FDBOX_NAME_LEN];
>+    };
>+
>+Get an FD from the box.
>+
>+After this returns, the FD identified by ``name`` is mapped into the task and is
>+available for use.
>+
>+``name``
>+    The name of the FD to get.
>+
>+``flags``
>+    Flags to the operation. Currently, no flags are defined.
>+
>+Returns:
>+    FD number on success, -1 on error with errno set.
>+
>+FDBOX_SEAL
>+~~~~~~~~~~
>+
>+.. code-block:: c
>+
>+    #define FDBOX_SEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 4)
>+
>+Seal the box.
>+
>+Gives the kernel an opportunity to ensure all dependencies are met in the box.
>+After this returns, the box is sealed and FDs can no longer be added or removed
>+from it. A box must be sealed for it to be transported across KHO.
>+
>+Returns:
>+    0 on success, -1 on error with errno set.
>+
>+FDBOX_UNSEAL
>+~~~~~~~~~~~~
>+
>+.. code-block:: c
>+
>+    #define FDBOX_UNSEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 5)
>+
>+Unseal the box.
>+
>+Gives the kernel an opportunity to ensure all dependencies are met in the box,
>+and in case of KHO, no FDs have been lost in transit.
>+
>+Returns:
>+    0 on success, -1 on error with errno set.
>+
>+Kernel functions and structures
>+===============================
>+
>+.. kernel-doc:: include/linux/fdbox.h
>diff --git a/Documentation/kho/index.rst b/Documentation/kho/index.rst
>index 5e7eeeca8520f..051513b956075 100644
>--- a/Documentation/kho/index.rst
>+++ b/Documentation/kho/index.rst
>@@ -1,5 +1,7 @@
> .. SPDX-License-Identifier: GPL-2.0-or-later
> 
>+.. _kho:
>+
> ========================
> Kexec Handover Subsystem
> ========================
>@@ -9,6 +11,7 @@ Kexec Handover Subsystem
> 
>    concepts
>    usage
>+   fdbox
> 
> .. only::  subproject and html
> 
>diff --git a/MAINTAINERS b/MAINTAINERS
>index d329d3e5514c5..135427582e60f 100644
>--- a/MAINTAINERS
>+++ b/MAINTAINERS
>@@ -8866,6 +8866,7 @@ FDBOX
> M:	Pratyush Yadav <pratyush@kernel.org>
> L:	linux-fsdevel@vger.kernel.org
> S:	Maintained
>+F:	Documentation/kho/fdbox.rst
> F:	drivers/misc/fdbox.c
> F:	include/linux/fdbox.h
> F:	include/uapi/linux/fdbox.h


~Randy

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-07  0:57 ` [RFC PATCH 1/5] misc: introduce FDBox Pratyush Yadav
@ 2025-03-07  6:03   ` Greg Kroah-Hartman
  2025-03-07  9:31   ` Christian Brauner
  1 sibling, 0 replies; 29+ messages in thread
From: Greg Kroah-Hartman @ 2025-03-07  6:03 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: linux-kernel, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Alexander Viro, Christian Brauner, Jan Kara, Hugh Dickins,
	Alexander Graf, Benjamin Herrenschmidt, David Woodhouse,
	James Gowans, Mike Rapoport, Paolo Bonzini, Pasha Tatashin,
	Anthony Yznaga, Dave Hansen, David Hildenbrand, Jason Gunthorpe,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

One quick review note:

On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
> +/**
> + * struct fdbox - A box of FDs.
> + * @name: Name of the box. Must be unique.
> + * @rwsem: Used to ensure exclusive access to the box during SEAL/UNSEAL
> + *         operations.
> + * @dev: Backing device for the character device.
> + * @cdev: Character device which accepts ioctls from userspace.

You now have a structure that contains 2 different reference counts,
which is going to be impossible to handle properly.  Which one defines
the lifetime of the object?  That's not going to work, please fix.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-07  0:57 ` [RFC PATCH 1/5] misc: introduce FDBox Pratyush Yadav
  2025-03-07  6:03   ` Greg Kroah-Hartman
@ 2025-03-07  9:31   ` Christian Brauner
  2025-03-07 13:19     ` Christian Brauner
                       ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Christian Brauner @ 2025-03-07  9:31 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Linus Torvalds, linux-kernel, Jonathan Corbet, Eric Biederman,
	Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
> The File Descriptor Box (FDBox) is a mechanism for userspace to name
> file descriptors and give them over to the kernel to hold. They can
> later be retrieved by passing in the same name.
> 
> The primary purpose of FDBox is to be used with Kexec Handover (KHO).
> There are many kinds anonymous file descriptors in the kernel like
> memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
> using KHO. To be able to do that, there needs to be a mechanism to label
> FDs that allows userspace to set the label before doing KHO and to use
> the label to map them back after KHO. FDBox achieves that purpose by
> exposing a miscdevice which exposes ioctls to label and transfer FDs
> between the kernel and userspace. FDBox is not intended to work with any
> generic file descriptor. Support for each kind of FDs must be explicitly
> enabled.

This makes no sense as a generic concept. If you want to restore shmem
and possibly anonymous inodes files via KHO then tailor the solution to
shmem and anon inodes but don't make this generic infrastructure. This
has zero chances to cover generic files.

As soon as you're dealing with non-kernel internal mounts that are not
guaranteed to always be there or something that depends on superblock or
mount specific information that can change you're already screwed. This
will end up a giant mess. This is not supportable or maintainable.

And struct file should have zero to do with this KHO stuff. It doesn't
need to carry new operations and it doesn't need to waste precious space
for any of this.

> 
> While the primary purpose of FDBox is to be used with KHO, it does not
> explicitly require CONFIG_KEXEC_HANDOVER, since it can be used without
> KHO, simply as a way to preserve or transfer FDs when userspace exits.

This use-case is covered with systemd's fdstore and it's available to
unprivileged userspace. Stashing arbitrary file descriptors in the
kernel in this way isn't a good idea.

> 
> Co-developed-by: Alexander Graf <graf@amazon.com>
> Signed-off-by: Alexander Graf <graf@amazon.com>
> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> ---
> 
> Notes:
>     In a real live-update environment, it would likely make more sense to
>     have a way of passing a hint to the kernel that KHO is about to happen
>     and it should start preparing. Having as much state serialized as
>     possible before the KHO freeze would help reduce downtime. An FDBox
>     operation, say FDBOX_PREPARE_FD that can give the signal to prepare
>     before actually being put in the box and sealed. I have not added
>     something like that yet for simplicity sake.
>
>  MAINTAINERS                |   8 +
>  drivers/misc/Kconfig       |   7 +
>  drivers/misc/Makefile      |   1 +
>  drivers/misc/fdbox.c       | 758 +++++++++++++++++++++++++++++++++++++
>  include/linux/fdbox.h      | 119 ++++++
>  include/linux/fs.h         |   7 +
>  include/linux/miscdevice.h |   1 +
>  include/uapi/linux/fdbox.h |  61 +++
>  8 files changed, 962 insertions(+)
>  create mode 100644 drivers/misc/fdbox.c
>  create mode 100644 include/linux/fdbox.h
>  create mode 100644 include/uapi/linux/fdbox.h
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 82c2ef421c000..d329d3e5514c5 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8862,6 +8862,14 @@ F:	include/scsi/libfc.h
>  F:	include/scsi/libfcoe.h
>  F:	include/uapi/scsi/fc/
>  
> +FDBOX
> +M:	Pratyush Yadav <pratyush@kernel.org>
> +L:	linux-fsdevel@vger.kernel.org
> +S:	Maintained
> +F:	drivers/misc/fdbox.c
> +F:	include/linux/fdbox.h
> +F:	include/uapi/linux/fdbox.h
> +
>  FILE LOCKING (flock() and fcntl()/lockf())
>  M:	Jeff Layton <jlayton@kernel.org>
>  M:	Chuck Lever <chuck.lever@oracle.com>
> diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> index 56bc72c7ce4a9..6fee70c9479c4 100644
> --- a/drivers/misc/Kconfig
> +++ b/drivers/misc/Kconfig
> @@ -632,6 +632,13 @@ config MCHP_LAN966X_PCI
>  	    - lan966x-miim (MDIO_MSCC_MIIM)
>  	    - lan966x-switch (LAN966X_SWITCH)
>  
> +config FDBOX
> +	bool "File Descriptor Box device to persist fds"
> +	help
> +	  Add a new /dev/fdbox directory that allows user space to preserve specific
> +	  types of file descritors when user space exits. Also preserves the same
> +	  types of file descriptors across kexec when KHO is enabled.
> +
>  source "drivers/misc/c2port/Kconfig"
>  source "drivers/misc/eeprom/Kconfig"
>  source "drivers/misc/cb710/Kconfig"
> diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> index 545aad06d0885..59a398dcfcd64 100644
> --- a/drivers/misc/Makefile
> +++ b/drivers/misc/Makefile
> @@ -75,3 +75,4 @@ lan966x-pci-objs		:= lan966x_pci.o
>  lan966x-pci-objs		+= lan966x_pci.dtbo.o
>  obj-$(CONFIG_MCHP_LAN966X_PCI)	+= lan966x-pci.o
>  obj-y				+= keba/
> +obj-$(CONFIG_FDBOX)		+= fdbox.o
> diff --git a/drivers/misc/fdbox.c b/drivers/misc/fdbox.c
> new file mode 100644
> index 0000000000000..a8f6574e2c25f
> --- /dev/null
> +++ b/drivers/misc/fdbox.c
> @@ -0,0 +1,758 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * fdbox.c - framework to preserve file descriptors across
> + *           process lifetime and kexec
> + *
> + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
> + *
> + * Author: Pratyush Yadav <ptyadav@amazon.de>
> + * Author: Alexander Graf <graf@amazon.com>
> + */
> +
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +#include <linux/device.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/cdev.h>
> +#include <linux/miscdevice.h>
> +#include <linux/fdtable.h>
> +#include <linux/file.h>
> +#include <linux/kexec.h>
> +#include <linux/kexec_handover.h>
> +#include <linux/libfdt.h>
> +#include <linux/fdbox.h>
> +
> +static struct miscdevice fdbox_dev;
> +
> +static struct {
> +	struct class			*class;
> +	dev_t				box_devt;
> +	struct xarray			box_list;
> +	struct xarray			handlers;
> +	struct rw_semaphore		recover_sem;
> +	bool				recover_done;
> +} priv = {
> +	.box_list = XARRAY_INIT(fdbox.box_list, XA_FLAGS_ALLOC),
> +	.handlers = XARRAY_INIT(fdbox.handlers, XA_FLAGS_ALLOC),
> +	.recover_sem = __RWSEM_INITIALIZER(priv.recover_sem),
> +};
> +
> +struct fdbox_handler {
> +	const char *compatible;
> +	struct file *(*fn)(const void *fdt, int offset);
> +};
> +
> +static struct fdbox *fdbox_remove_box(char *name)
> +{
> +	struct xarray *boxlist = &priv.box_list;
> +	unsigned long box_idx;
> +	struct fdbox *box;
> +
> +	xa_lock(boxlist);
> +	xa_for_each(boxlist, box_idx, box) {
> +		if (!strcmp(box->name, name)) {
> +			__xa_erase(boxlist, box_idx);
> +			break;
> +		}
> +	}
> +	xa_unlock(boxlist);
> +
> +	return box;
> +}
> +
> +static struct fdbox_fd *fdbox_remove_fd(struct fdbox *box, char *name)
> +{
> +	struct xarray *fdlist = &box->fd_list;
> +	struct fdbox_fd *box_fd;
> +	unsigned long idx;
> +
> +	xa_lock(fdlist);
> +	xa_for_each(fdlist, idx, box_fd) {
> +		if (!strncmp(box_fd->name, name, sizeof(box_fd->name))) {
> +			__xa_erase(fdlist, idx);
> +			break;
> +		}
> +	}
> +	xa_unlock(fdlist);
> +
> +	return box_fd;
> +}
> +
> +/* Must be called with box->rwsem held. */
> +static struct fdbox_fd *fdbox_put_file(struct fdbox *box, const char *name,
> +				       struct file *file)
> +{
> +	struct fdbox_fd *box_fd __free(kfree) = NULL, *cmp;
> +	struct xarray *fdlist = &box->fd_list;
> +	unsigned long idx;
> +	u32 newid;
> +	int ret;
> +
> +	/* Only files that set f_fdbox_op are allowed in the box. */
> +	if (!file->f_fdbox_op)
> +		return ERR_PTR(-EOPNOTSUPP);
> +
> +	box_fd = kzalloc(sizeof(*box_fd), GFP_KERNEL);
> +	if (!box_fd)
> +		return ERR_PTR(-ENOMEM);
> +
> +	if (strscpy_pad(box_fd->name, name, sizeof(box_fd->name)) < 0)
> +		/* Name got truncated. This means the name is not NUL-terminated. */
> +		return ERR_PTR(-EINVAL);
> +
> +	box_fd->file = file;
> +	box_fd->box = box;
> +
> +	xa_lock(fdlist);
> +	xa_for_each(fdlist, idx, cmp) {
> +		/* Look for name collisions. */
> +		if (!strcmp(box_fd->name, cmp->name)) {
> +			xa_unlock(fdlist);
> +			return ERR_PTR(-EEXIST);
> +		}
> +	}
> +
> +	ret = __xa_alloc(fdlist, &newid, box_fd, xa_limit_32b, GFP_KERNEL);
> +	xa_unlock(fdlist);
> +	if (ret)
> +		return ERR_PTR(ret);
> +
> +	return_ptr(box_fd);
> +}
> +
> +static long fdbox_put_fd(struct fdbox *box, unsigned long arg)
> +{
> +	struct fdbox_put_fd put_fd;
> +	struct fdbox_fd *box_fd;
> +	struct file *file;
> +	int ret;
> +
> +	if (copy_from_user(&put_fd, (void __user *)arg, sizeof(put_fd)))
> +		return -EFAULT;
> +
> +	guard(rwsem_read)(&box->rwsem);
> +
> +	if (box->sealed)
> +		return -EBUSY;
> +
> +	file = fget_raw(put_fd.fd);
> +	if (!file)
> +		return -EINVAL;
> +
> +	box_fd = fdbox_put_file(box, put_fd.name, file);
> +	if (IS_ERR(box_fd)) {
> +		fput(file);
> +		return PTR_ERR(box_fd);
> +	}
> +
> +	ret = close_fd(put_fd.fd);
> +	if (ret) {
> +		struct fdbox_fd *del;
> +
> +		del = fdbox_remove_fd(box, put_fd.name);
> +		/*
> +		 * If we fail to remove from list, it means someone else took
> +		 * the FD out. In that case, they own the refcount of the file
> +		 * now.
> +		 */
> +		if (del == box_fd)
> +			fput(file);

This is a racy mess. Why would adding a file to an fdbox be coupled with
closing it concpetually? The caller should close the file descriptor
itself and not do this close_fd() here in the kernel.

> +
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static long fdbox_seal(struct fdbox *box)
> +{
> +	struct fdbox_fd *box_fd;
> +	unsigned long idx;
> +	int ret;
> +
> +	guard(rwsem_write)(&box->rwsem);
> +
> +	if (box->sealed)
> +		return -EBUSY;
> +
> +	xa_for_each(&box->fd_list, idx, box_fd) {
> +		const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op;
> +
> +		if (fdbox_ops && fdbox_ops->seal) {
> +			ret = fdbox_ops->seal(box);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	box->sealed = true;
> +
> +	return 0;
> +}
> +
> +static long fdbox_unseal(struct fdbox *box)
> +{
> +	struct fdbox_fd *box_fd;
> +	unsigned long idx;
> +	int ret;
> +
> +	guard(rwsem_write)(&box->rwsem);
> +
> +	if (!box->sealed)
> +		return -EBUSY;
> +
> +	xa_for_each(&box->fd_list, idx, box_fd) {
> +		const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op;
> +
> +		if (fdbox_ops && fdbox_ops->seal) {
> +			ret = fdbox_ops->seal(box);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	box->sealed = false;
> +
> +	return 0;
> +}
> +
> +static long fdbox_get_fd(struct fdbox *box, unsigned long arg)
> +{
> +	struct fdbox_get_fd get_fd;
> +	struct fdbox_fd *box_fd;
> +	int fd;
> +
> +	guard(rwsem_read)(&box->rwsem);
> +
> +	if (box->sealed)
> +		return -EBUSY;
> +
> +	if (copy_from_user(&get_fd, (void __user *)arg, sizeof(get_fd)))
> +		return -EFAULT;
> +
> +	if (get_fd.flags)
> +		return -EINVAL;
> +
> +	fd = get_unused_fd_flags(0);
> +	if (fd < 0)
> +		return fd;
> +
> +	box_fd = fdbox_remove_fd(box, get_fd.name);
> +	if (!box_fd) {
> +		put_unused_fd(fd);
> +		return -ENOENT;
> +	}
> +
> +	fd_install(fd, box_fd->file);
> +	kfree(box_fd);
> +	return fd;
> +}
> +
> +static long box_fops_unl_ioctl(struct file *filep,
> +			       unsigned int cmd, unsigned long arg)
> +{
> +	struct fdbox *box = filep->private_data;
> +	long ret = -EINVAL;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	switch (cmd) {
> +	case FDBOX_PUT_FD:
> +		ret = fdbox_put_fd(box, arg);
> +		break;
> +	case FDBOX_UNSEAL:
> +		ret = fdbox_unseal(box);
> +		break;
> +	case FDBOX_SEAL:
> +		ret = fdbox_seal(box);
> +		break;
> +	case FDBOX_GET_FD:
> +		ret = fdbox_get_fd(box, arg);
> +		break;

How does userspace know what file descriptors are in this fdbox if only
put and get are present? Userspace just remembers the names and
otherwise it simply leaks files that no one remembered?

> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int box_fops_open(struct inode *inode, struct file *filep)
> +{
> +	struct fdbox *box = container_of(inode->i_cdev, struct fdbox, cdev);
> +
> +	filep->private_data = box;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations box_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= box_fops_unl_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.open		= box_fops_open,
> +};
> +
> +static void fdbox_device_release(struct device *dev)
> +{
> +	struct fdbox *box = container_of(dev, struct fdbox, dev);
> +	struct xarray *fdlist = &box->fd_list;
> +	struct fdbox_fd *box_fd;
> +	unsigned long idx;
> +
> +	unregister_chrdev_region(box->dev.devt, 1);
> +
> +	xa_for_each(fdlist, idx, box_fd) {
> +		xa_erase(fdlist, idx);
> +		fput(box_fd->file);
> +		kfree(box_fd);
> +	}
> +
> +	xa_destroy(fdlist);
> +	kfree(box);
> +}
> +
> +static struct fdbox *_fdbox_create_box(const char *name)
> +{
> +	struct fdbox *box;
> +	int ret = 0;
> +	u32 id;
> +
> +	box = kzalloc(sizeof(*box), GFP_KERNEL);
> +	if (!box)
> +		return ERR_PTR(-ENOMEM);
> +
> +	xa_init_flags(&box->fd_list, XA_FLAGS_ALLOC);
> +	xa_init_flags(&box->pending_fds, XA_FLAGS_ALLOC);
> +	init_rwsem(&box->rwsem);
> +
> +	if (strscpy_pad(box->name, name, sizeof(box->name)) < 0) {
> +		/* Name got truncated. This means the name is not NUL-terminated. */
> +		kfree(box);
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	dev_set_name(&box->dev, "fdbox/%s", name);
> +
> +	ret = alloc_chrdev_region(&box->dev.devt, 0, 1, name);
> +	if (ret) {
> +		kfree(box);
> +		return ERR_PTR(ret);
> +	}
> +
> +	box->dev.release = fdbox_device_release;
> +	device_initialize(&box->dev);
> +
> +	cdev_init(&box->cdev, &box_fops);
> +	box->cdev.owner = THIS_MODULE;
> +	kobject_set_name(&box->cdev.kobj, "fdbox/%s", name);
> +
> +	ret = cdev_device_add(&box->cdev, &box->dev);
> +	if (ret)
> +		goto err_dev;
> +
> +	ret = xa_alloc(&priv.box_list, &id, box, xa_limit_32b, GFP_KERNEL);
> +	if (ret)
> +		goto err_cdev;
> +
> +	return box;
> +
> +err_cdev:
> +	cdev_device_del(&box->cdev, &box->dev);
> +err_dev:
> +	/*
> +	 * This should free the box and chrdev region via
> +	 * fdbox_device_release().
> +	 */
> +	put_device(&box->dev);
> +
> +	return ERR_PTR(ret);
> +}
> +
> +static long fdbox_create_box(unsigned long arg)
> +{
> +	struct fdbox_create_box create_box;
> +
> +	if (copy_from_user(&create_box, (void __user *)arg, sizeof(create_box)))
> +		return -EFAULT;
> +
> +	if (create_box.flags)
> +		return -EINVAL;
> +
> +	return PTR_ERR_OR_ZERO(_fdbox_create_box(create_box.name));
> +}
> +
> +static void _fdbox_delete_box(struct fdbox *box)
> +{
> +	cdev_device_del(&box->cdev, &box->dev);
> +	unregister_chrdev_region(box->dev.devt, 1);
> +	put_device(&box->dev);
> +}
> +
> +static long fdbox_delete_box(unsigned long arg)
> +{
> +	struct fdbox_delete_box delete_box;
> +	struct fdbox *box;
> +
> +	if (copy_from_user(&delete_box, (void __user *)arg, sizeof(delete_box)))
> +		return -EFAULT;
> +
> +	if (delete_box.flags)
> +		return -EINVAL;
> +
> +	box = fdbox_remove_box(delete_box.name);
> +	if (!box)
> +		return -ENOENT;
> +
> +	_fdbox_delete_box(box);
> +	return 0;
> +}
> +
> +static long fdbox_fops_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	long ret = -EINVAL;
> +
> +	switch (cmd) {
> +	case FDBOX_CREATE_BOX:
> +		ret = fdbox_create_box(arg);
> +		break;
> +	case FDBOX_DELETE_BOX:
> +		ret = fdbox_delete_box(arg);
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static const struct file_operations fdbox_fops = {
> +	.owner		= THIS_MODULE,
> +	.unlocked_ioctl	= fdbox_fops_unl_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +};
> +
> +static struct miscdevice fdbox_dev = {
> +	.minor = FDBOX_MINOR,
> +	.name = "fdbox",
> +	.fops = &fdbox_fops,
> +	.nodename = "fdbox/fdbox",
> +	.mode = 0600,
> +};
> +
> +static char *fdbox_devnode(const struct device *dev, umode_t *mode)
> +{
> +	char *ret = kasprintf(GFP_KERNEL, "fdbox/%s", dev_name(dev));
> +	return ret;
> +}
> +
> +static int fdbox_kho_write_fds(void *fdt, struct fdbox *box)
> +{
> +	struct fdbox_fd *box_fd;
> +	struct file *file;
> +	unsigned long idx;
> +	int err = 0;
> +
> +	xa_for_each(&box->fd_list, idx, box_fd) {
> +		file = box_fd->file;
> +
> +		if (!file->f_fdbox_op->kho_write) {
> +			pr_info("box '%s' FD '%s' has no KHO method. It won't be saved across kexec\n",
> +				box->name, box_fd->name);
> +			continue;
> +		}
> +
> +		err = fdt_begin_node(fdt, box_fd->name);
> +		if (err) {
> +			pr_err("failed to begin node for box '%s' FD '%s'\n",
> +			       box->name, box_fd->name);
> +			return err;
> +		}
> +
> +		inode_lock(file_inode(file));
> +		err = file->f_fdbox_op->kho_write(box_fd, fdt);
> +		inode_unlock(file_inode(file));
> +		if (err) {
> +			pr_err("kho_write failed for box '%s' FD '%s': %d\n",
> +			       box->name, box_fd->name, err);
> +			return err;
> +		}
> +
> +		err = fdt_end_node(fdt);
> +		if (err) {
> +			/* TODO: This leaks all pages reserved by kho_write(). */
> +			pr_err("failed to end node for box '%s' FD '%s'\n",
> +			       box->name, box_fd->name);
> +			return err;
> +		}
> +	}
> +
> +	return err;
> +}
> +
> +static int fdbox_kho_write_boxes(void *fdt)
> +{
> +	static const char compatible[] = "fdbox,box-v1";
> +	struct fdbox *box;
> +	unsigned long idx;
> +	int err = 0;
> +
> +	xa_for_each(&priv.box_list, idx, box) {
> +		if (!box->sealed)
> +			continue;
> +
> +		err |= fdt_begin_node(fdt, box->name);
> +		err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
> +		err |= fdbox_kho_write_fds(fdt, box);
> +		err |= fdt_end_node(fdt);
> +	}
> +
> +	return err;
> +}
> +
> +static int fdbox_kho_notifier(struct notifier_block *self,
> +			      unsigned long cmd,
> +			      void *v)
> +{
> +	static const char compatible[] = "fdbox-v1";
> +	void *fdt = v;
> +	int err = 0;
> +
> +	switch (cmd) {
> +	case KEXEC_KHO_ABORT:
> +		return NOTIFY_DONE;
> +	case KEXEC_KHO_DUMP:
> +		/* Handled below */
> +		break;
> +	default:
> +		return NOTIFY_BAD;
> +	}
> +
> +	err |= fdt_begin_node(fdt, "fdbox");
> +	err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
> +	err |= fdbox_kho_write_boxes(fdt);
> +	err |= fdt_end_node(fdt);
> +
> +	return err ? NOTIFY_BAD : NOTIFY_DONE;
> +}
> +
> +static struct notifier_block fdbox_kho_nb = {
> +	.notifier_call = fdbox_kho_notifier,
> +};
> +
> +static void fdbox_recover_fd(const void *fdt, int offset, struct fdbox *box,
> +			     struct file *(*fn)(const void *fdt, int offset))
> +{
> +	struct fdbox_fd *box_fd;
> +	struct file *file;
> +	const char *name;
> +
> +	name = fdt_get_name(fdt, offset, NULL);
> +	if (!name) {
> +		pr_err("no name in FDT for FD at offset %d\n", offset);
> +		return;
> +	}
> +
> +	file = fn(fdt, offset);
> +	if (!file)
> +		return;
> +
> +	scoped_guard(rwsem_read, &box->rwsem) {
> +		box_fd = fdbox_put_file(box, name, file);
> +		if (IS_ERR(box_fd)) {
> +			pr_err("failed to put fd '%s' into box '%s': %ld\n",
> +			       box->name, name, PTR_ERR(box_fd));
> +			fput(file);
> +			return;
> +		}
> +	}
> +}
> +
> +static void fdbox_kho_recover(void)
> +{
> +	const void *fdt = kho_get_fdt();
> +	const char *path = "/fdbox";
> +	int off, box, fd;
> +	int err;
> +
> +	/* Not a KHO boot */
> +	if (!fdt)
> +		return;
> +
> +	/*
> +	 * When adding handlers this is taken as read. Taking it as write here
> +	 * ensures no handlers get added while nodes are being processed,
> +	 * eliminating the race of a handler getting added after its node is
> +	 * processed, but before the whole recover is done.
> +	 */
> +	guard(rwsem_write)(&priv.recover_sem);
> +
> +	off = fdt_path_offset(fdt, path);
> +	if (off < 0) {
> +		pr_debug("could not find '%s' in DT", path);
> +		return;
> +	}
> +
> +	err = fdt_node_check_compatible(fdt, off, "fdbox-v1");
> +	if (err) {
> +		pr_err("invalid top level compatible\n");
> +		return;
> +	}
> +
> +	fdt_for_each_subnode(box, fdt, off) {
> +		struct fdbox *new_box;
> +
> +		err = fdt_node_check_compatible(fdt, box, "fdbox,box-v1");
> +		if (err) {
> +			pr_err("invalid compatible for box '%s'\n",
> +			       fdt_get_name(fdt, box, NULL));
> +			continue;
> +		}
> +
> +		new_box = _fdbox_create_box(fdt_get_name(fdt, box, NULL));
> +		if (IS_ERR(new_box)) {
> +			pr_warn("could not create box '%s'\n",
> +				fdt_get_name(fdt, box, NULL));
> +			continue;
> +		}
> +
> +		fdt_for_each_subnode(fd, fdt, box) {
> +			struct fdbox_handler *handler;
> +			const char *compatible;
> +			unsigned long idx;
> +
> +			compatible = fdt_getprop(fdt, fd, "compatible", NULL);
> +			if (!compatible) {
> +				pr_warn("failed to get compatible for FD '%s'. Skipping.\n",
> +					fdt_get_name(fdt, fd, NULL));
> +				continue;
> +			}
> +
> +			xa_for_each(&priv.handlers, idx, handler) {
> +				if (!strcmp(handler->compatible, compatible))
> +					break;
> +			}
> +
> +			if (handler) {
> +				fdbox_recover_fd(fdt, fd, new_box, handler->fn);
> +			} else {
> +				u32 id;
> +
> +				pr_debug("found no handler for compatible %s. Queueing for later.\n",
> +					 compatible);
> +
> +				if (xa_alloc(&new_box->pending_fds, &id,
> +					     xa_mk_value(fd), xa_limit_32b,
> +					     GFP_KERNEL)) {
> +					pr_warn("failed to queue pending FD '%s' to list\n",
> +						fdt_get_name(fdt, fd, NULL));
> +				}
> +			}
> +		}
> +
> +		new_box->sealed = true;
> +	}
> +
> +	priv.recover_done = true;
> +}
> +
> +static void fdbox_recover_pending(struct fdbox_handler *handler)
> +{
> +	const void *fdt = kho_get_fdt();
> +	unsigned long bid, pid;
> +	struct fdbox *box;
> +	void *pending;
> +
> +	if (WARN_ON(!fdt))
> +		return;
> +
> +	xa_for_each(&priv.box_list, bid, box) {
> +		xa_for_each(&box->pending_fds, pid, pending) {
> +			int off = xa_to_value(pending);
> +
> +			if (fdt_node_check_compatible(fdt, off, handler->compatible) == 0) {
> +				fdbox_recover_fd(fdt, off, box, handler->fn);
> +				xa_erase(&box->pending_fds, pid);
> +			}
> +		}
> +	}
> +}
> +
> +int fdbox_register_handler(const char *compatible,
> +			   struct file *(*fn)(const void *fdt, int offset))
> +{
> +	struct xarray *handlers = &priv.handlers;
> +	struct fdbox_handler *handler, *cmp;
> +	unsigned long idx;
> +	int ret;
> +	u32 id;
> +
> +	/* See comment in fdbox_kho_recover(). */
> +	guard(rwsem_read)(&priv.recover_sem);
> +
> +	handler = kmalloc(sizeof(*handler), GFP_KERNEL);
> +	if (!handler)
> +		return -ENOMEM;
> +
> +	handler->compatible = compatible;
> +	handler->fn = fn;
> +
> +	xa_lock(handlers);
> +	xa_for_each(handlers, idx, cmp) {
> +		if (!strcmp(cmp->compatible, compatible)) {
> +			xa_unlock(handlers);
> +			kfree(handler);
> +			return -EEXIST;
> +		}
> +	}
> +
> +	ret = __xa_alloc(handlers, &id, handler, xa_limit_32b, GFP_KERNEL);
> +	xa_unlock(handlers);
> +	if (ret) {
> +		kfree(handler);
> +		return ret;
> +	}
> +
> +	if (priv.recover_done)
> +		fdbox_recover_pending(handler);
> +
> +	return 0;
> +}
> +
> +static int __init fdbox_init(void)
> +{
> +	int ret = 0;
> +
> +	/* /dev/fdbox/$NAME */
> +	priv.class = class_create("fdbox");
> +	if (IS_ERR(priv.class))
> +		return PTR_ERR(priv.class);
> +
> +	priv.class->devnode = fdbox_devnode;
> +
> +	ret = alloc_chrdev_region(&priv.box_devt, 0, 1, "fdbox");
> +	if (ret)
> +		goto err_class;
> +
> +	ret = misc_register(&fdbox_dev);
> +	if (ret) {
> +		pr_err("fdbox: misc device register failed\n");
> +		goto err_chrdev;
> +	}
> +
> +	if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) {
> +		register_kho_notifier(&fdbox_kho_nb);
> +		fdbox_kho_recover();
> +	}
> +
> +	return 0;
> +
> +err_chrdev:
> +	unregister_chrdev_region(priv.box_devt, 1);
> +	priv.box_devt = 0;
> +err_class:
> +	class_destroy(priv.class);
> +	priv.class = NULL;
> +	return ret;
> +}
> +module_init(fdbox_init);
> diff --git a/include/linux/fdbox.h b/include/linux/fdbox.h
> new file mode 100644
> index 0000000000000..0bc18742940f5
> --- /dev/null
> +++ b/include/linux/fdbox.h
> @@ -0,0 +1,119 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
> + *
> + * Author: Pratyush Yadav <ptyadav@amazon.de>
> + * Author: Alexander Graf <graf@amazon.com>
> + */
> +#ifndef _LINUX_FDBOX_H
> +#define _LINUX_FDBOX_H
> +
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +#include <linux/types.h>
> +#include <uapi/linux/fdbox.h>
> +
> +/**
> + * struct fdbox - A box of FDs.
> + * @name: Name of the box. Must be unique.
> + * @rwsem: Used to ensure exclusive access to the box during SEAL/UNSEAL
> + *         operations.
> + * @dev: Backing device for the character device.
> + * @cdev: Character device which accepts ioctls from userspace.
> + * @fd_list: List of FDs in the box.
> + * @sealed: Whether the box is sealed or not.
> + */
> +struct fdbox {
> +	char				name[FDBOX_NAME_LEN];
> +	/*
> +	 * Taken as read when non-exclusive access is needed and the box can be
> +	 * in mutable state. For example, the GET_FD and PUT_FD operations use
> +	 * it as read when adding or removing FDs from the box.
> +	 *
> +	 * Taken as write when exclusive access is needed and the box should be
> +	 * in a stable, non-mutable state. For example, the SEAL and UNSEAL
> +	 * operations use it as write because they need the list of FDs to be
> +	 * stable.
> +	 */
> +	struct rw_semaphore		rwsem;
> +	struct device			dev;
> +	struct cdev			cdev;
> +	struct xarray			fd_list;
> +	struct xarray			pending_fds;
> +	bool				sealed;
> +};
> +
> +/**
> + * struct fdbox_fd - An FD in a box.
> + * @name: Name of the FD. Must be unique in the box.
> + * @file: Underlying file for the FD.
> + * @flags: Box flags. Currently, no flags are allowed.
> + * @box: The box to which this FD belongs.
> + */
> +struct fdbox_fd {
> +	char				name[FDBOX_NAME_LEN];
> +	struct file			*file;
> +	int				flags;
> +	struct fdbox			*box;
> +};
> +
> +/**
> + * struct fdbox_file_ops - operations for files that can be put into a fdbox.
> + */
> +struct fdbox_file_ops {
> +	/**
> +	 * @kho_write: write fd to KHO FDT.
> +	 *
> +	 * box_fd: Box FD to be serialized.
> +	 *
> +	 * fdt: KHO FDT
> +	 *
> +	 * This is called during KHO activation phase to serialize all data
> +	 * needed for a FD to be preserved across a KHO.
> +	 *
> +	 * Returns: 0 on success, -errno on failure. Error here causes KHO
> +	 * activation failure.
> +	 */
> +	int (*kho_write)(struct fdbox_fd *box_fd, void *fdt);
> +	/**
> +	 * @seal: seal the box
> +	 *
> +	 * box: Box which is going to be sealed.
> +	 *
> +	 * This can be set if a file has a dependency on other files. At seal
> +	 * time, all the FDs in the box can be inspected to ensure all the
> +	 * dependencies are met.
> +	 */
> +	int (*seal)(struct fdbox *box);
> +	/**
> +	 * @unseal: unseal the box
> +	 *
> +	 * box: Box which is going to be sealed.
> +	 *
> +	 * The opposite of seal. This can be set if a file has a dependency on
> +	 * other files. At unseal time, all the FDs in the box can be inspected
> +	 * to ensure all the dependencies are met. This can help ensure all
> +	 * necessary FDs made it through after a KHO for example.
> +	 */
> +	int (*unseal)(struct fdbox *box);
> +};
> +
> +/**
> + * fdbox_register_handler - register a handler for recovering Box FDs after KHO.
> + * @compatible: compatible string in the KHO FDT node.
> + * @handler: function to parse the FDT at offset 'offset'.
> + *
> + * After KHO, the FDs in the KHO FDT must be deserialized by the underlying
> + * modules or file systems. Since module initialization can be in any order,
> + * including after FDBox has been initialized, handler registration allows
> + * modules to queue their parsing functions, and FDBox will execute them when it
> + * can.
> + *
> + * Returns: 0 on success, -errno otherwise.
> + */
> +int fdbox_register_handler(const char *compatible,
> +			   struct file *(*handler)(const void *fdt, int offset));
> +#endif /* _LINUX_FDBOX_H */
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index be3ad155ec9f7..7d710a5e09b5b 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -81,6 +81,9 @@ struct fs_context;
>  struct fs_parameter_spec;
>  struct fileattr;
>  struct iomap_ops;
> +struct fdbox;
> +struct fdbox_fd;
> +struct fdbox_file_ops;
>  
>  extern void __init inode_init(void);
>  extern void __init inode_init_early(void);
> @@ -1078,6 +1081,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
>   * @f_llist: work queue entrypoint
>   * @f_ra: file's readahead state
>   * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.)
> + * @f_fdbox_op: FDBOX operations
>   */
>  struct file {
>  	file_ref_t			f_ref;
> @@ -1116,6 +1120,9 @@ struct file {
>  		freeptr_t		f_freeptr;
>  	};
>  	/* --- cacheline 3 boundary (192 bytes) --- */
> +#ifdef CONFIG_FDBOX
> +	const struct fdbox_file_ops	*f_fdbox_op;
> +#endif

I've shrunk struct file significantly over the last kernel releases
removing useless member and gave it a new refcount mechanism to increase
performance.

This is now taking 8 free bytes for this a very special-purpose thing.
Those 8 bytes should be used for stuff that is widely useful. So that
already makes me go "no-way".

struct file shouldn't have to know about any of this stuff. Just
recognize the type of file by its struct file_operations or a lot of
other ways and then infer the fdbox ops from that.

>  } __randomize_layout
>    __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
>  
> diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> index 69e110c2b86a9..fedb873c04453 100644
> --- a/include/linux/miscdevice.h
> +++ b/include/linux/miscdevice.h
> @@ -71,6 +71,7 @@
>  #define USERIO_MINOR		240
>  #define VHOST_VSOCK_MINOR	241
>  #define RFKILL_MINOR		242
> +#define FDBOX_MINOR		243
>  #define MISC_DYNAMIC_MINOR	255
>  
>  struct device;
> diff --git a/include/uapi/linux/fdbox.h b/include/uapi/linux/fdbox.h
> new file mode 100644
> index 0000000000000..577ba33b908fd
> --- /dev/null
> +++ b/include/uapi/linux/fdbox.h
> @@ -0,0 +1,61 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * This file contains definitions and structures for fdbox ioctls.
> + *
> + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
> + *
> + * Author: Pratyush Yadav <ptyadav@amazon.de>
> + * Author: Alexander Graf <graf@amazon.com>
> + */
> +#ifndef _UAPI_LINUX_FDBOX_H
> +#define _UAPI_LINUX_FDBOX_H
> +
> +#include <linux/types.h>
> +#include <linux/ioctl.h>
> +
> +#define FDBOX_NAME_LEN			256
> +
> +#define FDBOX_TYPE	('.')
> +#define FDBOX_BASE	0
> +
> +/* Ioctls on /dev/fdbox/fdbox */
> +
> +/* Create a box. */
> +#define FDBOX_CREATE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 0)
> +struct fdbox_create_box {
> +	__u64 flags;
> +	__u8 name[FDBOX_NAME_LEN];
> +};
> +
> +/* Delete a box. */
> +#define FDBOX_DELETE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 1)
> +struct fdbox_delete_box {
> +	__u64 flags;
> +	__u8 name[FDBOX_NAME_LEN];
> +};
> +
> +/* Ioctls on /dev/fdbox/$BOXNAME */
> +
> +/* Put FD into box. This unmaps the FD from the calling process. */
> +#define FDBOX_PUT_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 2)
> +struct fdbox_put_fd {
> +	__u64 flags;
> +	__u32 fd;
> +	__u32 pad;
> +	__u8 name[FDBOX_NAME_LEN];
> +};
> +
> +/* Get the FD from box. This maps the FD into the calling process. */
> +#define FDBOX_GET_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 3)
> +struct fdbox_get_fd {
> +	__u64 flags;
> +	__u32 pad;
> +	__u8 name[FDBOX_NAME_LEN];
> +};
> +
> +/* Seal the box. After this, no FDs can be put in or taken out of the box. */
> +#define FDBOX_SEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 4)
> +/* Unseal the box. Opposite of seal. */
> +#define FDBOX_UNSEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 5)
> +
> +#endif /* _UAPI_LINUX_FDBOX_H */
> -- 
> 2.47.1
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-07  9:31   ` Christian Brauner
@ 2025-03-07 13:19     ` Christian Brauner
  2025-03-07 15:14     ` Jason Gunthorpe
  2025-03-08  0:10     ` Pratyush Yadav
  2 siblings, 0 replies; 29+ messages in thread
From: Christian Brauner @ 2025-03-07 13:19 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Linus Torvalds, linux-kernel, Jonathan Corbet, Eric Biederman,
	Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

On Fri, Mar 07, 2025 at 10:31:39AM +0100, Christian Brauner wrote:
> On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
> > The File Descriptor Box (FDBox) is a mechanism for userspace to name
> > file descriptors and give them over to the kernel to hold. They can
> > later be retrieved by passing in the same name.
> > 
> > The primary purpose of FDBox is to be used with Kexec Handover (KHO).
> > There are many kinds anonymous file descriptors in the kernel like
> > memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
> > using KHO. To be able to do that, there needs to be a mechanism to label
> > FDs that allows userspace to set the label before doing KHO and to use
> > the label to map them back after KHO. FDBox achieves that purpose by
> > exposing a miscdevice which exposes ioctls to label and transfer FDs
> > between the kernel and userspace. FDBox is not intended to work with any
> > generic file descriptor. Support for each kind of FDs must be explicitly
> > enabled.
> 
> This makes no sense as a generic concept. If you want to restore shmem
> and possibly anonymous inodes files via KHO then tailor the solution to
> shmem and anon inodes but don't make this generic infrastructure. This
> has zero chances to cover generic files.
> 
> As soon as you're dealing with non-kernel internal mounts that are not
> guaranteed to always be there or something that depends on superblock or
> mount specific information that can change you're already screwed. This
> will end up a giant mess. This is not supportable or maintainable.
> 
> And struct file should have zero to do with this KHO stuff. It doesn't
> need to carry new operations and it doesn't need to waste precious space
> for any of this.
> 
> > 
> > While the primary purpose of FDBox is to be used with KHO, it does not
> > explicitly require CONFIG_KEXEC_HANDOVER, since it can be used without
> > KHO, simply as a way to preserve or transfer FDs when userspace exits.
> 
> This use-case is covered with systemd's fdstore and it's available to
> unprivileged userspace. Stashing arbitrary file descriptors in the
> kernel in this way isn't a good idea.
> 
> > 
> > Co-developed-by: Alexander Graf <graf@amazon.com>
> > Signed-off-by: Alexander Graf <graf@amazon.com>
> > Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> > ---
> > 
> > Notes:
> >     In a real live-update environment, it would likely make more sense to
> >     have a way of passing a hint to the kernel that KHO is about to happen
> >     and it should start preparing. Having as much state serialized as
> >     possible before the KHO freeze would help reduce downtime. An FDBox
> >     operation, say FDBOX_PREPARE_FD that can give the signal to prepare
> >     before actually being put in the box and sealed. I have not added
> >     something like that yet for simplicity sake.
> >
> >  MAINTAINERS                |   8 +
> >  drivers/misc/Kconfig       |   7 +
> >  drivers/misc/Makefile      |   1 +
> >  drivers/misc/fdbox.c       | 758 +++++++++++++++++++++++++++++++++++++
> >  include/linux/fdbox.h      | 119 ++++++
> >  include/linux/fs.h         |   7 +
> >  include/linux/miscdevice.h |   1 +
> >  include/uapi/linux/fdbox.h |  61 +++
> >  8 files changed, 962 insertions(+)
> >  create mode 100644 drivers/misc/fdbox.c
> >  create mode 100644 include/linux/fdbox.h
> >  create mode 100644 include/uapi/linux/fdbox.h
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index 82c2ef421c000..d329d3e5514c5 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -8862,6 +8862,14 @@ F:	include/scsi/libfc.h
> >  F:	include/scsi/libfcoe.h
> >  F:	include/uapi/scsi/fc/
> >  
> > +FDBOX
> > +M:	Pratyush Yadav <pratyush@kernel.org>
> > +L:	linux-fsdevel@vger.kernel.org
> > +S:	Maintained
> > +F:	drivers/misc/fdbox.c
> > +F:	include/linux/fdbox.h
> > +F:	include/uapi/linux/fdbox.h
> > +
> >  FILE LOCKING (flock() and fcntl()/lockf())
> >  M:	Jeff Layton <jlayton@kernel.org>
> >  M:	Chuck Lever <chuck.lever@oracle.com>
> > diff --git a/drivers/misc/Kconfig b/drivers/misc/Kconfig
> > index 56bc72c7ce4a9..6fee70c9479c4 100644
> > --- a/drivers/misc/Kconfig
> > +++ b/drivers/misc/Kconfig
> > @@ -632,6 +632,13 @@ config MCHP_LAN966X_PCI
> >  	    - lan966x-miim (MDIO_MSCC_MIIM)
> >  	    - lan966x-switch (LAN966X_SWITCH)
> >  
> > +config FDBOX
> > +	bool "File Descriptor Box device to persist fds"
> > +	help
> > +	  Add a new /dev/fdbox directory that allows user space to preserve specific
> > +	  types of file descritors when user space exits. Also preserves the same
> > +	  types of file descriptors across kexec when KHO is enabled.
> > +
> >  source "drivers/misc/c2port/Kconfig"
> >  source "drivers/misc/eeprom/Kconfig"
> >  source "drivers/misc/cb710/Kconfig"
> > diff --git a/drivers/misc/Makefile b/drivers/misc/Makefile
> > index 545aad06d0885..59a398dcfcd64 100644
> > --- a/drivers/misc/Makefile
> > +++ b/drivers/misc/Makefile
> > @@ -75,3 +75,4 @@ lan966x-pci-objs		:= lan966x_pci.o
> >  lan966x-pci-objs		+= lan966x_pci.dtbo.o
> >  obj-$(CONFIG_MCHP_LAN966X_PCI)	+= lan966x-pci.o
> >  obj-y				+= keba/
> > +obj-$(CONFIG_FDBOX)		+= fdbox.o
> > diff --git a/drivers/misc/fdbox.c b/drivers/misc/fdbox.c
> > new file mode 100644
> > index 0000000000000..a8f6574e2c25f
> > --- /dev/null
> > +++ b/drivers/misc/fdbox.c
> > @@ -0,0 +1,758 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * fdbox.c - framework to preserve file descriptors across
> > + *           process lifetime and kexec
> > + *
> > + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
> > + *
> > + * Author: Pratyush Yadav <ptyadav@amazon.de>
> > + * Author: Alexander Graf <graf@amazon.com>
> > + */
> > +
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +#include <linux/device.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/cdev.h>
> > +#include <linux/miscdevice.h>
> > +#include <linux/fdtable.h>
> > +#include <linux/file.h>
> > +#include <linux/kexec.h>
> > +#include <linux/kexec_handover.h>
> > +#include <linux/libfdt.h>
> > +#include <linux/fdbox.h>
> > +
> > +static struct miscdevice fdbox_dev;
> > +
> > +static struct {
> > +	struct class			*class;
> > +	dev_t				box_devt;
> > +	struct xarray			box_list;
> > +	struct xarray			handlers;
> > +	struct rw_semaphore		recover_sem;
> > +	bool				recover_done;
> > +} priv = {
> > +	.box_list = XARRAY_INIT(fdbox.box_list, XA_FLAGS_ALLOC),
> > +	.handlers = XARRAY_INIT(fdbox.handlers, XA_FLAGS_ALLOC),
> > +	.recover_sem = __RWSEM_INITIALIZER(priv.recover_sem),
> > +};
> > +
> > +struct fdbox_handler {
> > +	const char *compatible;
> > +	struct file *(*fn)(const void *fdt, int offset);
> > +};
> > +
> > +static struct fdbox *fdbox_remove_box(char *name)
> > +{
> > +	struct xarray *boxlist = &priv.box_list;
> > +	unsigned long box_idx;
> > +	struct fdbox *box;
> > +
> > +	xa_lock(boxlist);
> > +	xa_for_each(boxlist, box_idx, box) {
> > +		if (!strcmp(box->name, name)) {
> > +			__xa_erase(boxlist, box_idx);
> > +			break;
> > +		}
> > +	}
> > +	xa_unlock(boxlist);
> > +
> > +	return box;
> > +}
> > +
> > +static struct fdbox_fd *fdbox_remove_fd(struct fdbox *box, char *name)
> > +{
> > +	struct xarray *fdlist = &box->fd_list;
> > +	struct fdbox_fd *box_fd;
> > +	unsigned long idx;
> > +
> > +	xa_lock(fdlist);
> > +	xa_for_each(fdlist, idx, box_fd) {
> > +		if (!strncmp(box_fd->name, name, sizeof(box_fd->name))) {
> > +			__xa_erase(fdlist, idx);
> > +			break;
> > +		}
> > +	}
> > +	xa_unlock(fdlist);
> > +
> > +	return box_fd;
> > +}
> > +
> > +/* Must be called with box->rwsem held. */
> > +static struct fdbox_fd *fdbox_put_file(struct fdbox *box, const char *name,
> > +				       struct file *file)
> > +{
> > +	struct fdbox_fd *box_fd __free(kfree) = NULL, *cmp;
> > +	struct xarray *fdlist = &box->fd_list;
> > +	unsigned long idx;
> > +	u32 newid;
> > +	int ret;
> > +
> > +	/* Only files that set f_fdbox_op are allowed in the box. */
> > +	if (!file->f_fdbox_op)
> > +		return ERR_PTR(-EOPNOTSUPP);
> > +
> > +	box_fd = kzalloc(sizeof(*box_fd), GFP_KERNEL);
> > +	if (!box_fd)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	if (strscpy_pad(box_fd->name, name, sizeof(box_fd->name)) < 0)
> > +		/* Name got truncated. This means the name is not NUL-terminated. */
> > +		return ERR_PTR(-EINVAL);
> > +
> > +	box_fd->file = file;
> > +	box_fd->box = box;
> > +
> > +	xa_lock(fdlist);
> > +	xa_for_each(fdlist, idx, cmp) {
> > +		/* Look for name collisions. */
> > +		if (!strcmp(box_fd->name, cmp->name)) {
> > +			xa_unlock(fdlist);
> > +			return ERR_PTR(-EEXIST);
> > +		}
> > +	}
> > +
> > +	ret = __xa_alloc(fdlist, &newid, box_fd, xa_limit_32b, GFP_KERNEL);
> > +	xa_unlock(fdlist);
> > +	if (ret)
> > +		return ERR_PTR(ret);
> > +
> > +	return_ptr(box_fd);
> > +}
> > +
> > +static long fdbox_put_fd(struct fdbox *box, unsigned long arg)
> > +{
> > +	struct fdbox_put_fd put_fd;
> > +	struct fdbox_fd *box_fd;
> > +	struct file *file;
> > +	int ret;
> > +
> > +	if (copy_from_user(&put_fd, (void __user *)arg, sizeof(put_fd)))
> > +		return -EFAULT;
> > +
> > +	guard(rwsem_read)(&box->rwsem);
> > +
> > +	if (box->sealed)
> > +		return -EBUSY;
> > +
> > +	file = fget_raw(put_fd.fd);
> > +	if (!file)
> > +		return -EINVAL;
> > +
> > +	box_fd = fdbox_put_file(box, put_fd.name, file);
> > +	if (IS_ERR(box_fd)) {
> > +		fput(file);
> > +		return PTR_ERR(box_fd);
> > +	}
> > +
> > +	ret = close_fd(put_fd.fd);
> > +	if (ret) {
> > +		struct fdbox_fd *del;
> > +
> > +		del = fdbox_remove_fd(box, put_fd.name);
> > +		/*
> > +		 * If we fail to remove from list, it means someone else took
> > +		 * the FD out. In that case, they own the refcount of the file
> > +		 * now.
> > +		 */
> > +		if (del == box_fd)
> > +			fput(file);
> 
> This is a racy mess. Why would adding a file to an fdbox be coupled with
> closing it concpetually? The caller should close the file descriptor
> itself and not do this close_fd() here in the kernel.
> 
> > +
> > +		return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static long fdbox_seal(struct fdbox *box)
> > +{
> > +	struct fdbox_fd *box_fd;
> > +	unsigned long idx;
> > +	int ret;
> > +
> > +	guard(rwsem_write)(&box->rwsem);
> > +
> > +	if (box->sealed)
> > +		return -EBUSY;
> > +
> > +	xa_for_each(&box->fd_list, idx, box_fd) {
> > +		const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op;
> > +
> > +		if (fdbox_ops && fdbox_ops->seal) {
> > +			ret = fdbox_ops->seal(box);
> > +			if (ret)
> > +				return ret;
> > +		}
> > +	}
> > +
> > +	box->sealed = true;
> > +
> > +	return 0;
> > +}
> > +
> > +static long fdbox_unseal(struct fdbox *box)
> > +{
> > +	struct fdbox_fd *box_fd;
> > +	unsigned long idx;
> > +	int ret;
> > +
> > +	guard(rwsem_write)(&box->rwsem);
> > +
> > +	if (!box->sealed)
> > +		return -EBUSY;
> > +
> > +	xa_for_each(&box->fd_list, idx, box_fd) {
> > +		const struct fdbox_file_ops *fdbox_ops = box_fd->file->f_fdbox_op;
> > +
> > +		if (fdbox_ops && fdbox_ops->seal) {
> > +			ret = fdbox_ops->seal(box);
> > +			if (ret)
> > +				return ret;
> > +		}
> > +	}
> > +
> > +	box->sealed = false;
> > +
> > +	return 0;
> > +}
> > +
> > +static long fdbox_get_fd(struct fdbox *box, unsigned long arg)
> > +{
> > +	struct fdbox_get_fd get_fd;
> > +	struct fdbox_fd *box_fd;
> > +	int fd;
> > +
> > +	guard(rwsem_read)(&box->rwsem);
> > +
> > +	if (box->sealed)
> > +		return -EBUSY;
> > +
> > +	if (copy_from_user(&get_fd, (void __user *)arg, sizeof(get_fd)))
> > +		return -EFAULT;
> > +
> > +	if (get_fd.flags)
> > +		return -EINVAL;
> > +
> > +	fd = get_unused_fd_flags(0);
> > +	if (fd < 0)
> > +		return fd;
> > +
> > +	box_fd = fdbox_remove_fd(box, get_fd.name);
> > +	if (!box_fd) {
> > +		put_unused_fd(fd);
> > +		return -ENOENT;
> > +	}
> > +
> > +	fd_install(fd, box_fd->file);
> > +	kfree(box_fd);
> > +	return fd;
> > +}
> > +
> > +static long box_fops_unl_ioctl(struct file *filep,
> > +			       unsigned int cmd, unsigned long arg)
> > +{
> > +	struct fdbox *box = filep->private_data;
> > +	long ret = -EINVAL;
> > +
> > +	if (!capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> > +
> > +	switch (cmd) {
> > +	case FDBOX_PUT_FD:
> > +		ret = fdbox_put_fd(box, arg);
> > +		break;
> > +	case FDBOX_UNSEAL:
> > +		ret = fdbox_unseal(box);
> > +		break;
> > +	case FDBOX_SEAL:
> > +		ret = fdbox_seal(box);
> > +		break;
> > +	case FDBOX_GET_FD:
> > +		ret = fdbox_get_fd(box, arg);
> > +		break;
> 
> How does userspace know what file descriptors are in this fdbox if only
> put and get are present? Userspace just remembers the names and
> otherwise it simply leaks files that no one remembered?
> 
> > +	default:
> > +		ret = -EINVAL;
> > +		break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static int box_fops_open(struct inode *inode, struct file *filep)
> > +{
> > +	struct fdbox *box = container_of(inode->i_cdev, struct fdbox, cdev);
> > +
> > +	filep->private_data = box;
> > +
> > +	return 0;
> > +}
> > +
> > +static const struct file_operations box_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.unlocked_ioctl	= box_fops_unl_ioctl,
> > +	.compat_ioctl	= compat_ptr_ioctl,
> > +	.open		= box_fops_open,
> > +};
> > +
> > +static void fdbox_device_release(struct device *dev)
> > +{
> > +	struct fdbox *box = container_of(dev, struct fdbox, dev);
> > +	struct xarray *fdlist = &box->fd_list;
> > +	struct fdbox_fd *box_fd;
> > +	unsigned long idx;
> > +
> > +	unregister_chrdev_region(box->dev.devt, 1);
> > +
> > +	xa_for_each(fdlist, idx, box_fd) {
> > +		xa_erase(fdlist, idx);
> > +		fput(box_fd->file);
> > +		kfree(box_fd);
> > +	}
> > +
> > +	xa_destroy(fdlist);
> > +	kfree(box);
> > +}
> > +
> > +static struct fdbox *_fdbox_create_box(const char *name)
> > +{
> > +	struct fdbox *box;
> > +	int ret = 0;
> > +	u32 id;
> > +
> > +	box = kzalloc(sizeof(*box), GFP_KERNEL);
> > +	if (!box)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	xa_init_flags(&box->fd_list, XA_FLAGS_ALLOC);
> > +	xa_init_flags(&box->pending_fds, XA_FLAGS_ALLOC);
> > +	init_rwsem(&box->rwsem);
> > +
> > +	if (strscpy_pad(box->name, name, sizeof(box->name)) < 0) {
> > +		/* Name got truncated. This means the name is not NUL-terminated. */
> > +		kfree(box);
> > +		return ERR_PTR(-EINVAL);
> > +	}
> > +
> > +	dev_set_name(&box->dev, "fdbox/%s", name);
> > +
> > +	ret = alloc_chrdev_region(&box->dev.devt, 0, 1, name);
> > +	if (ret) {
> > +		kfree(box);
> > +		return ERR_PTR(ret);
> > +	}
> > +
> > +	box->dev.release = fdbox_device_release;
> > +	device_initialize(&box->dev);
> > +
> > +	cdev_init(&box->cdev, &box_fops);
> > +	box->cdev.owner = THIS_MODULE;
> > +	kobject_set_name(&box->cdev.kobj, "fdbox/%s", name);
> > +
> > +	ret = cdev_device_add(&box->cdev, &box->dev);
> > +	if (ret)
> > +		goto err_dev;
> > +
> > +	ret = xa_alloc(&priv.box_list, &id, box, xa_limit_32b, GFP_KERNEL);
> > +	if (ret)
> > +		goto err_cdev;
> > +
> > +	return box;
> > +
> > +err_cdev:
> > +	cdev_device_del(&box->cdev, &box->dev);
> > +err_dev:
> > +	/*
> > +	 * This should free the box and chrdev region via
> > +	 * fdbox_device_release().
> > +	 */
> > +	put_device(&box->dev);
> > +
> > +	return ERR_PTR(ret);
> > +}
> > +
> > +static long fdbox_create_box(unsigned long arg)
> > +{
> > +	struct fdbox_create_box create_box;
> > +
> > +	if (copy_from_user(&create_box, (void __user *)arg, sizeof(create_box)))
> > +		return -EFAULT;
> > +
> > +	if (create_box.flags)
> > +		return -EINVAL;
> > +
> > +	return PTR_ERR_OR_ZERO(_fdbox_create_box(create_box.name));
> > +}
> > +
> > +static void _fdbox_delete_box(struct fdbox *box)
> > +{
> > +	cdev_device_del(&box->cdev, &box->dev);
> > +	unregister_chrdev_region(box->dev.devt, 1);
> > +	put_device(&box->dev);
> > +}
> > +
> > +static long fdbox_delete_box(unsigned long arg)
> > +{
> > +	struct fdbox_delete_box delete_box;
> > +	struct fdbox *box;
> > +
> > +	if (copy_from_user(&delete_box, (void __user *)arg, sizeof(delete_box)))
> > +		return -EFAULT;
> > +
> > +	if (delete_box.flags)
> > +		return -EINVAL;
> > +
> > +	box = fdbox_remove_box(delete_box.name);
> > +	if (!box)
> > +		return -ENOENT;
> > +
> > +	_fdbox_delete_box(box);
> > +	return 0;
> > +}
> > +
> > +static long fdbox_fops_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	long ret = -EINVAL;
> > +
> > +	switch (cmd) {
> > +	case FDBOX_CREATE_BOX:
> > +		ret = fdbox_create_box(arg);
> > +		break;
> > +	case FDBOX_DELETE_BOX:
> > +		ret = fdbox_delete_box(arg);
> > +		break;
> > +	}
> > +
> > +	return ret;
> > +}
> > +
> > +static const struct file_operations fdbox_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.unlocked_ioctl	= fdbox_fops_unl_ioctl,
> > +	.compat_ioctl	= compat_ptr_ioctl,
> > +};
> > +
> > +static struct miscdevice fdbox_dev = {
> > +	.minor = FDBOX_MINOR,
> > +	.name = "fdbox",
> > +	.fops = &fdbox_fops,
> > +	.nodename = "fdbox/fdbox",
> > +	.mode = 0600,
> > +};
> > +
> > +static char *fdbox_devnode(const struct device *dev, umode_t *mode)
> > +{
> > +	char *ret = kasprintf(GFP_KERNEL, "fdbox/%s", dev_name(dev));
> > +	return ret;
> > +}
> > +
> > +static int fdbox_kho_write_fds(void *fdt, struct fdbox *box)
> > +{
> > +	struct fdbox_fd *box_fd;
> > +	struct file *file;
> > +	unsigned long idx;
> > +	int err = 0;
> > +
> > +	xa_for_each(&box->fd_list, idx, box_fd) {
> > +		file = box_fd->file;
> > +
> > +		if (!file->f_fdbox_op->kho_write) {
> > +			pr_info("box '%s' FD '%s' has no KHO method. It won't be saved across kexec\n",
> > +				box->name, box_fd->name);
> > +			continue;
> > +		}
> > +
> > +		err = fdt_begin_node(fdt, box_fd->name);
> > +		if (err) {
> > +			pr_err("failed to begin node for box '%s' FD '%s'\n",
> > +			       box->name, box_fd->name);
> > +			return err;
> > +		}
> > +
> > +		inode_lock(file_inode(file));
> > +		err = file->f_fdbox_op->kho_write(box_fd, fdt);
> > +		inode_unlock(file_inode(file));
> > +		if (err) {
> > +			pr_err("kho_write failed for box '%s' FD '%s': %d\n",
> > +			       box->name, box_fd->name, err);
> > +			return err;
> > +		}
> > +
> > +		err = fdt_end_node(fdt);
> > +		if (err) {
> > +			/* TODO: This leaks all pages reserved by kho_write(). */
> > +			pr_err("failed to end node for box '%s' FD '%s'\n",
> > +			       box->name, box_fd->name);
> > +			return err;
> > +		}
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +static int fdbox_kho_write_boxes(void *fdt)
> > +{
> > +	static const char compatible[] = "fdbox,box-v1";
> > +	struct fdbox *box;
> > +	unsigned long idx;
> > +	int err = 0;
> > +
> > +	xa_for_each(&priv.box_list, idx, box) {
> > +		if (!box->sealed)
> > +			continue;
> > +
> > +		err |= fdt_begin_node(fdt, box->name);
> > +		err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
> > +		err |= fdbox_kho_write_fds(fdt, box);
> > +		err |= fdt_end_node(fdt);
> > +	}
> > +
> > +	return err;
> > +}
> > +
> > +static int fdbox_kho_notifier(struct notifier_block *self,
> > +			      unsigned long cmd,
> > +			      void *v)
> > +{
> > +	static const char compatible[] = "fdbox-v1";
> > +	void *fdt = v;
> > +	int err = 0;
> > +
> > +	switch (cmd) {
> > +	case KEXEC_KHO_ABORT:
> > +		return NOTIFY_DONE;
> > +	case KEXEC_KHO_DUMP:
> > +		/* Handled below */
> > +		break;
> > +	default:
> > +		return NOTIFY_BAD;
> > +	}
> > +
> > +	err |= fdt_begin_node(fdt, "fdbox");
> > +	err |= fdt_property(fdt, "compatible", compatible, sizeof(compatible));
> > +	err |= fdbox_kho_write_boxes(fdt);
> > +	err |= fdt_end_node(fdt);
> > +
> > +	return err ? NOTIFY_BAD : NOTIFY_DONE;
> > +}
> > +
> > +static struct notifier_block fdbox_kho_nb = {
> > +	.notifier_call = fdbox_kho_notifier,
> > +};
> > +
> > +static void fdbox_recover_fd(const void *fdt, int offset, struct fdbox *box,
> > +			     struct file *(*fn)(const void *fdt, int offset))
> > +{
> > +	struct fdbox_fd *box_fd;
> > +	struct file *file;
> > +	const char *name;
> > +
> > +	name = fdt_get_name(fdt, offset, NULL);
> > +	if (!name) {
> > +		pr_err("no name in FDT for FD at offset %d\n", offset);
> > +		return;
> > +	}
> > +
> > +	file = fn(fdt, offset);
> > +	if (!file)
> > +		return;

Any userspace that does rely on an inode number or other specific inode
information will be confused by the recovered file as there's zero
guarantee you get the same inode number back.

If userspace relies on specific ownership or mode information for the
inode you're not restoring it.

Any associated struct file has credentials attached to it via that
aren't restored nor is it reasonably restorable. The same goes for LSM
information associated with that file. Cgroup information is also lost.

Say you have userspace processes that have dup()ed file descriptors to
the same struct file and both files are put into the fdstore by
different names you'll end up serializing the same underlying inode
twice. Then on recovery, you restore two different files meaning that
two userspace processes that thought they share the same file won't
anymore.

Either you enlighten all relevant userspace to go to the fdbox to
recover their files so that the credentials match or nothing really
matches before and after recovery.

> > +
> > +	scoped_guard(rwsem_read, &box->rwsem) {
> > +		box_fd = fdbox_put_file(box, name, file);
> > +		if (IS_ERR(box_fd)) {
> > +			pr_err("failed to put fd '%s' into box '%s': %ld\n",
> > +			       box->name, name, PTR_ERR(box_fd));
> > +			fput(file);
> > +			return;
> > +		}
> > +	}
> > +}
> > +
> > +static void fdbox_kho_recover(void)
> > +{
> > +	const void *fdt = kho_get_fdt();
> > +	const char *path = "/fdbox";
> > +	int off, box, fd;
> > +	int err;
> > +
> > +	/* Not a KHO boot */
> > +	if (!fdt)
> > +		return;
> > +
> > +	/*
> > +	 * When adding handlers this is taken as read. Taking it as write here
> > +	 * ensures no handlers get added while nodes are being processed,
> > +	 * eliminating the race of a handler getting added after its node is
> > +	 * processed, but before the whole recover is done.
> > +	 */
> > +	guard(rwsem_write)(&priv.recover_sem);
> > +
> > +	off = fdt_path_offset(fdt, path);
> > +	if (off < 0) {
> > +		pr_debug("could not find '%s' in DT", path);
> > +		return;
> > +	}
> > +
> > +	err = fdt_node_check_compatible(fdt, off, "fdbox-v1");
> > +	if (err) {
> > +		pr_err("invalid top level compatible\n");
> > +		return;
> > +	}
> > +
> > +	fdt_for_each_subnode(box, fdt, off) {
> > +		struct fdbox *new_box;
> > +
> > +		err = fdt_node_check_compatible(fdt, box, "fdbox,box-v1");
> > +		if (err) {
> > +			pr_err("invalid compatible for box '%s'\n",
> > +			       fdt_get_name(fdt, box, NULL));
> > +			continue;
> > +		}
> > +
> > +		new_box = _fdbox_create_box(fdt_get_name(fdt, box, NULL));
> > +		if (IS_ERR(new_box)) {
> > +			pr_warn("could not create box '%s'\n",
> > +				fdt_get_name(fdt, box, NULL));
> > +			continue;
> > +		}
> > +
> > +		fdt_for_each_subnode(fd, fdt, box) {
> > +			struct fdbox_handler *handler;
> > +			const char *compatible;
> > +			unsigned long idx;
> > +
> > +			compatible = fdt_getprop(fdt, fd, "compatible", NULL);
> > +			if (!compatible) {
> > +				pr_warn("failed to get compatible for FD '%s'. Skipping.\n",
> > +					fdt_get_name(fdt, fd, NULL));
> > +				continue;
> > +			}
> > +
> > +			xa_for_each(&priv.handlers, idx, handler) {
> > +				if (!strcmp(handler->compatible, compatible))
> > +					break;
> > +			}
> > +
> > +			if (handler) {
> > +				fdbox_recover_fd(fdt, fd, new_box, handler->fn);
> > +			} else {
> > +				u32 id;
> > +
> > +				pr_debug("found no handler for compatible %s. Queueing for later.\n",
> > +					 compatible);
> > +
> > +				if (xa_alloc(&new_box->pending_fds, &id,
> > +					     xa_mk_value(fd), xa_limit_32b,
> > +					     GFP_KERNEL)) {
> > +					pr_warn("failed to queue pending FD '%s' to list\n",
> > +						fdt_get_name(fdt, fd, NULL));
> > +				}
> > +			}
> > +		}
> > +
> > +		new_box->sealed = true;
> > +	}
> > +
> > +	priv.recover_done = true;
> > +}
> > +
> > +static void fdbox_recover_pending(struct fdbox_handler *handler)
> > +{
> > +	const void *fdt = kho_get_fdt();
> > +	unsigned long bid, pid;
> > +	struct fdbox *box;
> > +	void *pending;
> > +
> > +	if (WARN_ON(!fdt))
> > +		return;
> > +
> > +	xa_for_each(&priv.box_list, bid, box) {
> > +		xa_for_each(&box->pending_fds, pid, pending) {
> > +			int off = xa_to_value(pending);
> > +
> > +			if (fdt_node_check_compatible(fdt, off, handler->compatible) == 0) {
> > +				fdbox_recover_fd(fdt, off, box, handler->fn);
> > +				xa_erase(&box->pending_fds, pid);
> > +			}
> > +		}
> > +	}
> > +}
> > +
> > +int fdbox_register_handler(const char *compatible,
> > +			   struct file *(*fn)(const void *fdt, int offset))
> > +{
> > +	struct xarray *handlers = &priv.handlers;
> > +	struct fdbox_handler *handler, *cmp;
> > +	unsigned long idx;
> > +	int ret;
> > +	u32 id;
> > +
> > +	/* See comment in fdbox_kho_recover(). */
> > +	guard(rwsem_read)(&priv.recover_sem);
> > +
> > +	handler = kmalloc(sizeof(*handler), GFP_KERNEL);
> > +	if (!handler)
> > +		return -ENOMEM;
> > +
> > +	handler->compatible = compatible;
> > +	handler->fn = fn;
> > +
> > +	xa_lock(handlers);
> > +	xa_for_each(handlers, idx, cmp) {
> > +		if (!strcmp(cmp->compatible, compatible)) {
> > +			xa_unlock(handlers);
> > +			kfree(handler);
> > +			return -EEXIST;
> > +		}
> > +	}
> > +
> > +	ret = __xa_alloc(handlers, &id, handler, xa_limit_32b, GFP_KERNEL);
> > +	xa_unlock(handlers);
> > +	if (ret) {
> > +		kfree(handler);
> > +		return ret;
> > +	}
> > +
> > +	if (priv.recover_done)
> > +		fdbox_recover_pending(handler);
> > +
> > +	return 0;
> > +}
> > +
> > +static int __init fdbox_init(void)
> > +{
> > +	int ret = 0;
> > +
> > +	/* /dev/fdbox/$NAME */
> > +	priv.class = class_create("fdbox");
> > +	if (IS_ERR(priv.class))
> > +		return PTR_ERR(priv.class);
> > +
> > +	priv.class->devnode = fdbox_devnode;
> > +
> > +	ret = alloc_chrdev_region(&priv.box_devt, 0, 1, "fdbox");
> > +	if (ret)
> > +		goto err_class;
> > +
> > +	ret = misc_register(&fdbox_dev);
> > +	if (ret) {
> > +		pr_err("fdbox: misc device register failed\n");
> > +		goto err_chrdev;
> > +	}
> > +
> > +	if (IS_ENABLED(CONFIG_KEXEC_HANDOVER)) {
> > +		register_kho_notifier(&fdbox_kho_nb);
> > +		fdbox_kho_recover();
> > +	}
> > +
> > +	return 0;
> > +
> > +err_chrdev:
> > +	unregister_chrdev_region(priv.box_devt, 1);
> > +	priv.box_devt = 0;
> > +err_class:
> > +	class_destroy(priv.class);
> > +	priv.class = NULL;
> > +	return ret;
> > +}
> > +module_init(fdbox_init);
> > diff --git a/include/linux/fdbox.h b/include/linux/fdbox.h
> > new file mode 100644
> > index 0000000000000..0bc18742940f5
> > --- /dev/null
> > +++ b/include/linux/fdbox.h
> > @@ -0,0 +1,119 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
> > + *
> > + * Author: Pratyush Yadav <ptyadav@amazon.de>
> > + * Author: Alexander Graf <graf@amazon.com>
> > + */
> > +#ifndef _LINUX_FDBOX_H
> > +#define _LINUX_FDBOX_H
> > +
> > +#include <linux/cdev.h>
> > +#include <linux/device.h>
> > +#include <linux/fs.h>
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> > +#include <linux/types.h>
> > +#include <uapi/linux/fdbox.h>
> > +
> > +/**
> > + * struct fdbox - A box of FDs.
> > + * @name: Name of the box. Must be unique.
> > + * @rwsem: Used to ensure exclusive access to the box during SEAL/UNSEAL
> > + *         operations.
> > + * @dev: Backing device for the character device.
> > + * @cdev: Character device which accepts ioctls from userspace.
> > + * @fd_list: List of FDs in the box.
> > + * @sealed: Whether the box is sealed or not.
> > + */
> > +struct fdbox {
> > +	char				name[FDBOX_NAME_LEN];
> > +	/*
> > +	 * Taken as read when non-exclusive access is needed and the box can be
> > +	 * in mutable state. For example, the GET_FD and PUT_FD operations use
> > +	 * it as read when adding or removing FDs from the box.
> > +	 *
> > +	 * Taken as write when exclusive access is needed and the box should be
> > +	 * in a stable, non-mutable state. For example, the SEAL and UNSEAL
> > +	 * operations use it as write because they need the list of FDs to be
> > +	 * stable.
> > +	 */
> > +	struct rw_semaphore		rwsem;
> > +	struct device			dev;
> > +	struct cdev			cdev;
> > +	struct xarray			fd_list;
> > +	struct xarray			pending_fds;
> > +	bool				sealed;
> > +};
> > +
> > +/**
> > + * struct fdbox_fd - An FD in a box.
> > + * @name: Name of the FD. Must be unique in the box.
> > + * @file: Underlying file for the FD.
> > + * @flags: Box flags. Currently, no flags are allowed.
> > + * @box: The box to which this FD belongs.
> > + */
> > +struct fdbox_fd {
> > +	char				name[FDBOX_NAME_LEN];
> > +	struct file			*file;
> > +	int				flags;
> > +	struct fdbox			*box;
> > +};
> > +
> > +/**
> > + * struct fdbox_file_ops - operations for files that can be put into a fdbox.
> > + */
> > +struct fdbox_file_ops {
> > +	/**
> > +	 * @kho_write: write fd to KHO FDT.
> > +	 *
> > +	 * box_fd: Box FD to be serialized.
> > +	 *
> > +	 * fdt: KHO FDT
> > +	 *
> > +	 * This is called during KHO activation phase to serialize all data
> > +	 * needed for a FD to be preserved across a KHO.
> > +	 *
> > +	 * Returns: 0 on success, -errno on failure. Error here causes KHO
> > +	 * activation failure.
> > +	 */
> > +	int (*kho_write)(struct fdbox_fd *box_fd, void *fdt);
> > +	/**
> > +	 * @seal: seal the box
> > +	 *
> > +	 * box: Box which is going to be sealed.
> > +	 *
> > +	 * This can be set if a file has a dependency on other files. At seal
> > +	 * time, all the FDs in the box can be inspected to ensure all the
> > +	 * dependencies are met.
> > +	 */
> > +	int (*seal)(struct fdbox *box);
> > +	/**
> > +	 * @unseal: unseal the box
> > +	 *
> > +	 * box: Box which is going to be sealed.
> > +	 *
> > +	 * The opposite of seal. This can be set if a file has a dependency on
> > +	 * other files. At unseal time, all the FDs in the box can be inspected
> > +	 * to ensure all the dependencies are met. This can help ensure all
> > +	 * necessary FDs made it through after a KHO for example.
> > +	 */
> > +	int (*unseal)(struct fdbox *box);
> > +};
> > +
> > +/**
> > + * fdbox_register_handler - register a handler for recovering Box FDs after KHO.
> > + * @compatible: compatible string in the KHO FDT node.
> > + * @handler: function to parse the FDT at offset 'offset'.
> > + *
> > + * After KHO, the FDs in the KHO FDT must be deserialized by the underlying
> > + * modules or file systems. Since module initialization can be in any order,
> > + * including after FDBox has been initialized, handler registration allows
> > + * modules to queue their parsing functions, and FDBox will execute them when it
> > + * can.
> > + *
> > + * Returns: 0 on success, -errno otherwise.
> > + */
> > +int fdbox_register_handler(const char *compatible,
> > +			   struct file *(*handler)(const void *fdt, int offset));
> > +#endif /* _LINUX_FDBOX_H */
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index be3ad155ec9f7..7d710a5e09b5b 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -81,6 +81,9 @@ struct fs_context;
> >  struct fs_parameter_spec;
> >  struct fileattr;
> >  struct iomap_ops;
> > +struct fdbox;
> > +struct fdbox_fd;
> > +struct fdbox_file_ops;
> >  
> >  extern void __init inode_init(void);
> >  extern void __init inode_init_early(void);
> > @@ -1078,6 +1081,7 @@ static inline int ra_has_index(struct file_ra_state *ra, pgoff_t index)
> >   * @f_llist: work queue entrypoint
> >   * @f_ra: file's readahead state
> >   * @f_freeptr: Pointer used by SLAB_TYPESAFE_BY_RCU file cache (don't touch.)
> > + * @f_fdbox_op: FDBOX operations
> >   */
> >  struct file {
> >  	file_ref_t			f_ref;
> > @@ -1116,6 +1120,9 @@ struct file {
> >  		freeptr_t		f_freeptr;
> >  	};
> >  	/* --- cacheline 3 boundary (192 bytes) --- */
> > +#ifdef CONFIG_FDBOX
> > +	const struct fdbox_file_ops	*f_fdbox_op;
> > +#endif
> 
> I've shrunk struct file significantly over the last kernel releases
> removing useless member and gave it a new refcount mechanism to increase
> performance.
> 
> This is now taking 8 free bytes for this a very special-purpose thing.
> Those 8 bytes should be used for stuff that is widely useful. So that
> already makes me go "no-way".
> 
> struct file shouldn't have to know about any of this stuff. Just
> recognize the type of file by its struct file_operations or a lot of
> other ways and then infer the fdbox ops from that.
> 
> >  } __randomize_layout
> >    __attribute__((aligned(4)));	/* lest something weird decides that 2 is OK */
> >  
> > diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h
> > index 69e110c2b86a9..fedb873c04453 100644
> > --- a/include/linux/miscdevice.h
> > +++ b/include/linux/miscdevice.h
> > @@ -71,6 +71,7 @@
> >  #define USERIO_MINOR		240
> >  #define VHOST_VSOCK_MINOR	241
> >  #define RFKILL_MINOR		242
> > +#define FDBOX_MINOR		243
> >  #define MISC_DYNAMIC_MINOR	255
> >  
> >  struct device;
> > diff --git a/include/uapi/linux/fdbox.h b/include/uapi/linux/fdbox.h
> > new file mode 100644
> > index 0000000000000..577ba33b908fd
> > --- /dev/null
> > +++ b/include/uapi/linux/fdbox.h
> > @@ -0,0 +1,61 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +/*
> > + * This file contains definitions and structures for fdbox ioctls.
> > + *
> > + * Copyright (C) 2024-2025 Amazon.com Inc. or its affiliates.
> > + *
> > + * Author: Pratyush Yadav <ptyadav@amazon.de>
> > + * Author: Alexander Graf <graf@amazon.com>
> > + */
> > +#ifndef _UAPI_LINUX_FDBOX_H
> > +#define _UAPI_LINUX_FDBOX_H
> > +
> > +#include <linux/types.h>
> > +#include <linux/ioctl.h>
> > +
> > +#define FDBOX_NAME_LEN			256
> > +
> > +#define FDBOX_TYPE	('.')
> > +#define FDBOX_BASE	0
> > +
> > +/* Ioctls on /dev/fdbox/fdbox */
> > +
> > +/* Create a box. */
> > +#define FDBOX_CREATE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 0)
> > +struct fdbox_create_box {
> > +	__u64 flags;
> > +	__u8 name[FDBOX_NAME_LEN];
> > +};
> > +
> > +/* Delete a box. */
> > +#define FDBOX_DELETE_BOX	_IO(FDBOX_TYPE, FDBOX_BASE + 1)
> > +struct fdbox_delete_box {
> > +	__u64 flags;
> > +	__u8 name[FDBOX_NAME_LEN];
> > +};
> > +
> > +/* Ioctls on /dev/fdbox/$BOXNAME */
> > +
> > +/* Put FD into box. This unmaps the FD from the calling process. */
> > +#define FDBOX_PUT_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 2)
> > +struct fdbox_put_fd {
> > +	__u64 flags;
> > +	__u32 fd;
> > +	__u32 pad;
> > +	__u8 name[FDBOX_NAME_LEN];
> > +};
> > +
> > +/* Get the FD from box. This maps the FD into the calling process. */
> > +#define FDBOX_GET_FD	_IO(FDBOX_TYPE, FDBOX_BASE + 3)
> > +struct fdbox_get_fd {
> > +	__u64 flags;
> > +	__u32 pad;
> > +	__u8 name[FDBOX_NAME_LEN];
> > +};
> > +
> > +/* Seal the box. After this, no FDs can be put in or taken out of the box. */
> > +#define FDBOX_SEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 4)
> > +/* Unseal the box. Opposite of seal. */
> > +#define FDBOX_UNSEAL	_IO(FDBOX_TYPE, FDBOX_BASE + 5)
> > +
> > +#endif /* _UAPI_LINUX_FDBOX_H */
> > -- 
> > 2.47.1
> > 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/5] misc: add documentation for FDBox
  2025-03-07  0:57 ` [RFC PATCH 2/5] misc: add documentation for FDBox Pratyush Yadav
  2025-03-07  2:19   ` Randy Dunlap
@ 2025-03-07 14:22   ` Jonathan Corbet
  2025-03-07 14:51     ` Pratyush Yadav
  1 sibling, 1 reply; 29+ messages in thread
From: Jonathan Corbet @ 2025-03-07 14:22 UTC (permalink / raw)
  To: Pratyush Yadav, linux-kernel
  Cc: Pratyush Yadav, Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman,
	Alexander Viro, Christian Brauner, Jan Kara, Hugh Dickins,
	Alexander Graf, Benjamin Herrenschmidt, David Woodhouse,
	James Gowans, Mike Rapoport, Paolo Bonzini, Pasha Tatashin,
	Anthony Yznaga, Dave Hansen, David Hildenbrand, Jason Gunthorpe,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

Pratyush Yadav <ptyadav@amazon.de> writes:

> With FDBox in place, add documentation that describes what it is and how
> it is used, along with its UAPI and in-kernel API.
>
> Since the document refers to KHO, add a reference tag in kho/index.rst.
>
> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
> ---
>  Documentation/filesystems/locking.rst |  21 +++
>  Documentation/kho/fdbox.rst           | 224 ++++++++++++++++++++++++++
>  Documentation/kho/index.rst           |   3 +
>  MAINTAINERS                           |   1 +
>  4 files changed, 249 insertions(+)
>  create mode 100644 Documentation/kho/fdbox.rst

Please do not create a new top-level directory under Documentation for
this; your new file belongs in userspace-api/.

From a quick perusal of your documentation:

- You never say what "KHO" is

- Your boxes live in a single global namespace?

- What sort of access control applies to one of these boxes?  What keeps
  me from mucking around inside somebody else's box?

Thanks,

jon

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/5] misc: add documentation for FDBox
  2025-03-07 14:22   ` Jonathan Corbet
@ 2025-03-07 14:51     ` Pratyush Yadav
  2025-03-07 15:25       ` Jonathan Corbet
  0 siblings, 1 reply; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07 14:51 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linux-kernel, Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman,
	Alexander Viro, Christian Brauner, Jan Kara, Hugh Dickins,
	Alexander Graf, Benjamin Herrenschmidt, David Woodhouse,
	James Gowans, Mike Rapoport, Paolo Bonzini, Pasha Tatashin,
	Anthony Yznaga, Dave Hansen, David Hildenbrand, Jason Gunthorpe,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Fri, Mar 07 2025, Jonathan Corbet wrote:

> Pratyush Yadav <ptyadav@amazon.de> writes:
>
>> With FDBox in place, add documentation that describes what it is and how
>> it is used, along with its UAPI and in-kernel API.
>>
>> Since the document refers to KHO, add a reference tag in kho/index.rst.
>>
>> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
>> ---
>>  Documentation/filesystems/locking.rst |  21 +++
>>  Documentation/kho/fdbox.rst           | 224 ++++++++++++++++++++++++++
>>  Documentation/kho/index.rst           |   3 +
>>  MAINTAINERS                           |   1 +
>>  4 files changed, 249 insertions(+)
>>  create mode 100644 Documentation/kho/fdbox.rst
>
> Please do not create a new top-level directory under Documentation for
> this; your new file belongs in userspace-api/.

I did not. The top-level directory comes from the KHO patches [0] (not
merged yet). This series is based on top of those. You can find the full
tree here [1].

Since this is closely tied to KHO I found it a good fit for putting it
on KHO's directory. I don't have strong opinions about this so don't
mind moving it elsewhere if you think that is better.

>
> From a quick perusal of your documentation:
>
> - You never say what "KHO" is

fdbox.rst has a reference to Documentation/kho/index.rst which does
explain what Kexec Handover (KHO) means. Due to the ref to the top-level
heading, the rendered text looks like:

>     The primary purpose of FDBox is to be used with Kexec Handover Subsystem.
                    This is a link to kho/index.rst   ^^^^^^^^^^^^^^^^^^^^^^^^

IMO that is enough explanation, and there would be little benefit in
duplicating the explanation for KHO. Do you still think a one or two
line explanation is warranted here?

>
> - Your boxes live in a single global namespace?

Yes. Should they not? FWIW, the boxes are in a global namespace, but
each box has a namespace of its own for naming FDs. All FD names in a
single box should be unique but the same FD name can be used in two
different boxes.

>
> - What sort of access control applies to one of these boxes?  What keeps
>   me from mucking around inside somebody else's box?

For now, none. You need CAP_SYS_ADMIN to be able to muck around with a
box. The current idea is that we only let root use it and if more a fine
grained permission model needed it can be done in userspace, or we can
extend our permission model later.

[0] https://lore.kernel.org/lkml/20250206132754.2596694-10-rppt@kernel.org/
[1] https://web.git.kernel.org/pub/scm/linux/kernel/git/pratyush/linux.git/tree/Documentation/kho?h=kho

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/5] misc: add documentation for FDBox
  2025-03-07  2:19   ` Randy Dunlap
@ 2025-03-07 15:03     ` Pratyush Yadav
  0 siblings, 0 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07 15:03 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: linux-kernel, Jonathan Corbet, Eric Biederman, Arnd Bergmann,
	Greg Kroah-Hartman, Alexander Viro, Christian Brauner, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

On Thu, Mar 06 2025, Randy Dunlap wrote:

> On March 6, 2025 4:57:36 PM PST, Pratyush Yadav <ptyadav@amazon.de> wrote:
>>With FDBox in place, add documentation that describes what it is and how
>>it is used, along with its UAPI and in-kernel API.
>>
>>Since the document refers to KHO, add a reference tag in kho/index.rst.
>>
>>Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
>>---
[...]
>>+
>>+The File Descriptor Box (FDBox) is a mechanism for userspace to name file
>>+descriptors and give them over to the kernel to hold. They can later be
>>+retrieved by passing in the same name.
>>+
>>+The primary purpose of FDBox is to be used with :ref:`kho`. There are many kinds
>
>     many kinds of 
>
>>+anonymous file descriptors in the kernel like memfd, guest_memfd, iommufd, etc.
>
>    etc.,

Thanks, will fix these.

[...]
>>+
>>+Box
>>+---
>>+
>>+The box is a container for FDs. Boxes are identified by their name, which must
>>+be unique. Userspace can put FDs in the box using the ``FDBOX_PUT_FD``
>>+operation, and take them out of the box using the ``FDBOX_GET_FD`` operation.
>
> Is this ioctl range documented is ioctl-number.rst?
> I didn't notice a patch for that.

My bad, missed that.

>
>>+Once all the required FDs are put into the box, it can be sealed to make it
>>+ready for shipping. This can be done by the ``FDBOX_SEAL`` operation. The seal
>>+operation notifies each FD in the box. If any of the FDs have a dependency on
>>+another, this gives them an opportunity to ensure all dependencies are met, or
>>+fail the seal if not. Once a box is sealed, no FDs can be added or removed from
>>+the box until it is unsealed. Only sealed boxes are transported to a new kernel
>
> What if KHO is not being used?

Then the FDs are lost on shutdown.

>
>>+via KHO. The box can be unsealed by the ``FDBOX_UNSEAL`` operation. This is the
>>+opposite of seal. It also notifies each FD in the box to ensure all dependencies
>>+are met. This can be useful in case some FDs fail to be restored after KHO.
>>+
>>+Box FD
>>+------
>
> I can't tell in my email font, but is each underlinoat least as long as the title above it?

They are. I went and double-checked as well. Maybe just something with
your email font.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-07  9:31   ` Christian Brauner
  2025-03-07 13:19     ` Christian Brauner
@ 2025-03-07 15:14     ` Jason Gunthorpe
  2025-03-08 11:09       ` Christian Brauner
  2025-03-08  0:10     ` Pratyush Yadav
  2 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-03-07 15:14 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pratyush Yadav, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Fri, Mar 07, 2025 at 10:31:39AM +0100, Christian Brauner wrote:
> On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
> > The File Descriptor Box (FDBox) is a mechanism for userspace to name
> > file descriptors and give them over to the kernel to hold. They can
> > later be retrieved by passing in the same name.
> > 
> > The primary purpose of FDBox is to be used with Kexec Handover (KHO).
> > There are many kinds anonymous file descriptors in the kernel like
> > memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
> > using KHO. To be able to do that, there needs to be a mechanism to label
> > FDs that allows userspace to set the label before doing KHO and to use
> > the label to map them back after KHO. FDBox achieves that purpose by
> > exposing a miscdevice which exposes ioctls to label and transfer FDs
> > between the kernel and userspace. FDBox is not intended to work with any
> > generic file descriptor. Support for each kind of FDs must be explicitly
> > enabled.
> 
> This makes no sense as a generic concept. If you want to restore shmem
> and possibly anonymous inodes files via KHO then tailor the solution to
> shmem and anon inodes but don't make this generic infrastructure. This
> has zero chances to cover generic files.

We need it to cover a range of FD types in the kernel like iommufd and
vfio.

It is not "generic" in the sense every FD in the kernel magicaly works
with fdbox, but that any driver/subsystem providing a FD could be
enlightened to support it.

Very much do not want the infrastructure tied to just shmem and memfd.

> As soon as you're dealing with non-kernel internal mounts that are not
> guaranteed to always be there or something that depends on superblock or
> mount specific information that can change you're already screwed.

This is really targetting at anonymous or character device file
descriptors that don't have issues with mounts.

Same remark about inode permissions and what not. The successor
kernel would be responsible to secure the FDBOX and when it takes
anything out it has to relabel it if required.

inode #s and things can change because this is not something like CRIU
that would have state linked to inode numbers. The applications in the
sucessor kernels are already very special, they will need to cope with
inode number changes along with all the other special stuff they do.

> And struct file should have zero to do with this KHO stuff. It doesn't
> need to carry new operations and it doesn't need to waste precious space
> for any of this.

Yeah, it should go through file_operations in some way.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/5] misc: add documentation for FDBox
  2025-03-07 14:51     ` Pratyush Yadav
@ 2025-03-07 15:25       ` Jonathan Corbet
  2025-03-07 23:28         ` Pratyush Yadav
  0 siblings, 1 reply; 29+ messages in thread
From: Jonathan Corbet @ 2025-03-07 15:25 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: linux-kernel, Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman,
	Alexander Viro, Christian Brauner, Jan Kara, Hugh Dickins,
	Alexander Graf, Benjamin Herrenschmidt, David Woodhouse,
	James Gowans, Mike Rapoport, Paolo Bonzini, Pasha Tatashin,
	Anthony Yznaga, Dave Hansen, David Hildenbrand, Jason Gunthorpe,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

Pratyush Yadav <ptyadav@amazon.de> writes:

> On Fri, Mar 07 2025, Jonathan Corbet wrote:
>
>> Pratyush Yadav <ptyadav@amazon.de> writes:
>>
>>> With FDBox in place, add documentation that describes what it is and how
>>> it is used, along with its UAPI and in-kernel API.
>>>
>>> Since the document refers to KHO, add a reference tag in kho/index.rst.
>>>
>>> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
>>> ---
>>>  Documentation/filesystems/locking.rst |  21 +++
>>>  Documentation/kho/fdbox.rst           | 224 ++++++++++++++++++++++++++
>>>  Documentation/kho/index.rst           |   3 +
>>>  MAINTAINERS                           |   1 +
>>>  4 files changed, 249 insertions(+)
>>>  create mode 100644 Documentation/kho/fdbox.rst
>>
>> Please do not create a new top-level directory under Documentation for
>> this; your new file belongs in userspace-api/.
>
> I did not. The top-level directory comes from the KHO patches [0] (not
> merged yet). This series is based on top of those. You can find the full
> tree here [1].
>
> Since this is closely tied to KHO I found it a good fit for putting it
> on KHO's directory. I don't have strong opinions about this so don't
> mind moving it elsewhere if you think that is better.

I've not seen the KHO stuff, but I suspect I'll grumble about that too.
Our documentation should be organized for its audience, not for its
authors.  So yes, I think that your documentation of the user-space
interface should definitely go in the userspace-api book.

Thanks,

jon

(Who now realizes he has been arguing this point of view for over ten
years ... eventually it will get across... :)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 2/5] misc: add documentation for FDBox
  2025-03-07 15:25       ` Jonathan Corbet
@ 2025-03-07 23:28         ` Pratyush Yadav
  0 siblings, 0 replies; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-07 23:28 UTC (permalink / raw)
  To: Jonathan Corbet
  Cc: linux-kernel, Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman,
	Alexander Viro, Christian Brauner, Jan Kara, Hugh Dickins,
	Alexander Graf, Benjamin Herrenschmidt, David Woodhouse,
	James Gowans, Mike Rapoport, Paolo Bonzini, Pasha Tatashin,
	Anthony Yznaga, Dave Hansen, David Hildenbrand, Jason Gunthorpe,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Fri, Mar 07 2025, Jonathan Corbet wrote:

> Pratyush Yadav <ptyadav@amazon.de> writes:
>
>> On Fri, Mar 07 2025, Jonathan Corbet wrote:
>>
>>> Pratyush Yadav <ptyadav@amazon.de> writes:
>>>
>>>> With FDBox in place, add documentation that describes what it is and how
>>>> it is used, along with its UAPI and in-kernel API.
>>>>
>>>> Since the document refers to KHO, add a reference tag in kho/index.rst.
>>>>
>>>> Signed-off-by: Pratyush Yadav <ptyadav@amazon.de>
>>>> ---
>>>>  Documentation/filesystems/locking.rst |  21 +++
>>>>  Documentation/kho/fdbox.rst           | 224 ++++++++++++++++++++++++++
>>>>  Documentation/kho/index.rst           |   3 +
>>>>  MAINTAINERS                           |   1 +
>>>>  4 files changed, 249 insertions(+)
>>>>  create mode 100644 Documentation/kho/fdbox.rst
>>>
>>> Please do not create a new top-level directory under Documentation for
>>> this; your new file belongs in userspace-api/.
>>
>> I did not. The top-level directory comes from the KHO patches [0] (not
>> merged yet). This series is based on top of those. You can find the full
>> tree here [1].
>>
>> Since this is closely tied to KHO I found it a good fit for putting it
>> on KHO's directory. I don't have strong opinions about this so don't
>> mind moving it elsewhere if you think that is better.
>
> I've not seen the KHO stuff, but I suspect I'll grumble about that too.
> Our documentation should be organized for its audience, not for its
> authors.  So yes, I think that your documentation of the user-space
> interface should definitely go in the userspace-api book.

Okay, fair enough. I'll move it there.

>
> Thanks,
>
> jon
>
> (Who now realizes he has been arguing this point of view for over ten
> years ... eventually it will get across... :)

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-07  9:31   ` Christian Brauner
  2025-03-07 13:19     ` Christian Brauner
  2025-03-07 15:14     ` Jason Gunthorpe
@ 2025-03-08  0:10     ` Pratyush Yadav
  2025-03-09 12:03       ` Christian Brauner
  2 siblings, 1 reply; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-08  0:10 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Linus Torvalds, linux-kernel, Jonathan Corbet, Eric Biederman,
	Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

Hi Christian,

Thanks for the review!

On Fri, Mar 07 2025, Christian Brauner wrote:

> On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
>> The File Descriptor Box (FDBox) is a mechanism for userspace to name
>> file descriptors and give them over to the kernel to hold. They can
>> later be retrieved by passing in the same name.
>> 
>> The primary purpose of FDBox is to be used with Kexec Handover (KHO).
>> There are many kinds anonymous file descriptors in the kernel like
>> memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
>> using KHO. To be able to do that, there needs to be a mechanism to label
>> FDs that allows userspace to set the label before doing KHO and to use
>> the label to map them back after KHO. FDBox achieves that purpose by
>> exposing a miscdevice which exposes ioctls to label and transfer FDs
>> between the kernel and userspace. FDBox is not intended to work with any
>> generic file descriptor. Support for each kind of FDs must be explicitly
>> enabled.
>
> This makes no sense as a generic concept. If you want to restore shmem
> and possibly anonymous inodes files via KHO then tailor the solution to
> shmem and anon inodes but don't make this generic infrastructure. This
> has zero chances to cover generic files.
>
> As soon as you're dealing with non-kernel internal mounts that are not
> guaranteed to always be there or something that depends on superblock or
> mount specific information that can change you're already screwed. This
> will end up a giant mess. This is not supportable or maintainable.

As Jason mentioned, this it _not_ intended to be a generic concept. My
documentation also says that, but perhaps that was not clear enough. It
is supposed to work with only specific type of file descriptors that
explicitly enable support for it in the context of kexec handover. I
think it might be a good idea to have an explicit dependency on KHO so
this distinction is a bit clearer.

It is also not intended to be completely transparent to userspace, where
they magically get their FD back exactly as they put in. I think we
should limit the amount of state we want to guarantee, since it directly
contributes to our ABI exposure to later kernels. The more state we
track, the more complex and inflexible our ABI becomes. So use of this
very much needs an enlightened userspace.

As an example, with memfd, the main purpose of persistence across kexec
is its memory contents. The application needs a way to carry its memory
across, and we provide it a mechanism to do so via FDBox. So we only
guarantee that the memory contents are preserved, along with some small
metadata, instead of the whole inode which is a lot more complex. The
application would then need to be aware of it and expect such changes.

The idea is also _not_ to have all the FDs of userspace into the box.
They should only put in the ones that they specifically need, and
re-open the rest normally. For example, in a live update scenario, the
VMM can put in the guest_memfds, the iommufds, and so on, but then
re-open configuration files or VM metadata via the normal path.

>
> And struct file should have zero to do with this KHO stuff. It doesn't
> need to carry new operations and it doesn't need to waste precious space
> for any of this.

That is a fair point. The main reason I did it this way is because memfd
does not have file_operations of its own. To enable the memfd
abstraction, I wanted the FDBox callback to go into memfd, and it can
pass it down to shmem or hugetlbfs. Having the pointer in struct file
makes that easy.

I think I can find ways to make it work via file_operations, so I will
do that in the next version.

>
>> 
>> While the primary purpose of FDBox is to be used with KHO, it does not
>> explicitly require CONFIG_KEXEC_HANDOVER, since it can be used without
>> KHO, simply as a way to preserve or transfer FDs when userspace exits.
>
> This use-case is covered with systemd's fdstore and it's available to
> unprivileged userspace. Stashing arbitrary file descriptors in the
> kernel in this way isn't a good idea.

For one, it can't be arbitrary FDs, but only explicitly enabled ones.
Beyond that, while not intended, there is no way to stop userspace from
using it as a stash. Stashing FDs is a needed operation for this to
work, and there is no way to guarantee in advance that userspace will
actually use it for KHO, and not just stash it to grab back later.

I think at least having an explicit dependency on CONFIG_KEXEC_HANDOVER
can help a bit at least.

[...]
>> +
>> +	ret = close_fd(put_fd.fd);
>> +	if (ret) {
>> +		struct fdbox_fd *del;
>> +
>> +		del = fdbox_remove_fd(box, put_fd.name);
>> +		/*
>> +		 * If we fail to remove from list, it means someone else took
>> +		 * the FD out. In that case, they own the refcount of the file
>> +		 * now.
>> +		 */
>> +		if (del == box_fd)
>> +			fput(file);
>
> This is a racy mess. Why would adding a file to an fdbox be coupled with
> closing it concpetually? The caller should close the file descriptor
> itself and not do this close_fd() here in the kernel.

Makes sense. We can make it a requirement to have all open FDs of the
file be closed by userspace before the box can be sealed.

>
>> +
>> +		return ret;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
[...]
>> +static long box_fops_unl_ioctl(struct file *filep,
>> +			       unsigned int cmd, unsigned long arg)
>> +{
>> +	struct fdbox *box = filep->private_data;
>> +	long ret = -EINVAL;
>> +
>> +	if (!capable(CAP_SYS_ADMIN))
>> +		return -EPERM;
>> +
>> +	switch (cmd) {
>> +	case FDBOX_PUT_FD:
>> +		ret = fdbox_put_fd(box, arg);
>> +		break;
>> +	case FDBOX_UNSEAL:
>> +		ret = fdbox_unseal(box);
>> +		break;
>> +	case FDBOX_SEAL:
>> +		ret = fdbox_seal(box);
>> +		break;
>> +	case FDBOX_GET_FD:
>> +		ret = fdbox_get_fd(box, arg);
>> +		break;
>
> How does userspace know what file descriptors are in this fdbox if only
> put and get are present? Userspace just remembers the names and
> otherwise it simply leaks files that no one remembered?

For now, it is supposed to remember that, but having a FDBOX_LIST
operation should be simple enough. Will add that in the next revision.

Also, if userspace suspects it forgot some, it can always delete the box
to clean up the leftover ones.

>
>> +	default:
>> +		ret = -EINVAL;
>> +		break;
>> +	}
>> +
>> +	return ret;
>> +}
>> +
[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-07 15:14     ` Jason Gunthorpe
@ 2025-03-08 11:09       ` Christian Brauner
  2025-03-17 16:46         ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2025-03-08 11:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Fri, Mar 07, 2025 at 11:14:17AM -0400, Jason Gunthorpe wrote:
> On Fri, Mar 07, 2025 at 10:31:39AM +0100, Christian Brauner wrote:
> > On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
> > > The File Descriptor Box (FDBox) is a mechanism for userspace to name
> > > file descriptors and give them over to the kernel to hold. They can
> > > later be retrieved by passing in the same name.
> > > 
> > > The primary purpose of FDBox is to be used with Kexec Handover (KHO).
> > > There are many kinds anonymous file descriptors in the kernel like
> > > memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
> > > using KHO. To be able to do that, there needs to be a mechanism to label
> > > FDs that allows userspace to set the label before doing KHO and to use
> > > the label to map them back after KHO. FDBox achieves that purpose by
> > > exposing a miscdevice which exposes ioctls to label and transfer FDs
> > > between the kernel and userspace. FDBox is not intended to work with any
> > > generic file descriptor. Support for each kind of FDs must be explicitly
> > > enabled.
> > 
> > This makes no sense as a generic concept. If you want to restore shmem
> > and possibly anonymous inodes files via KHO then tailor the solution to
> > shmem and anon inodes but don't make this generic infrastructure. This
> > has zero chances to cover generic files.
> 
> We need it to cover a range of FD types in the kernel like iommufd and

anonymous inode

> vfio.

anonymous inode

> 
> It is not "generic" in the sense every FD in the kernel magicaly works
> with fdbox, but that any driver/subsystem providing a FD could be
> enlightened to support it.
> 
> Very much do not want the infrastructure tied to just shmem and memfd.

Anything you can reasonably want will either be an internal shmem mount,
devtmpfs, or anonymous inodes. Anything else isn't going to work.

> 
> > As soon as you're dealing with non-kernel internal mounts that are not
> > guaranteed to always be there or something that depends on superblock or
> > mount specific information that can change you're already screwed.
> 
> This is really targetting at anonymous or character device file
> descriptors that don't have issues with mounts.
> 
> Same remark about inode permissions and what not. The successor
> kernel would be responsible to secure the FDBOX and when it takes
> anything out it has to relabel it if required.
> 
> inode #s and things can change because this is not something like CRIU
> that would have state linked to inode numbers. The applications in the
> sucessor kernels are already very special, they will need to cope with
> inode number changes along with all the other special stuff they do.
> 
> > And struct file should have zero to do with this KHO stuff. It doesn't
> > need to carry new operations and it doesn't need to waste precious space
> > for any of this.
> 
> Yeah, it should go through file_operations in some way.

I'm fine with a new method. There's not going to be three new methods
just for the sake of this special-purpose thing. And want this to be
part of fs/ and co-maintained by fs people.

I'm not yet sold that this needs to be a character device. Because
that's fundamentally limiting in how useful this can be.

It might be way more useful if this ended up being a separate tiny
filesystem where such preserved files are simply shown as named entries
that you can open instead of ioctl()ing your way through character
devices. But I need to think about that.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-08  0:10     ` Pratyush Yadav
@ 2025-03-09 12:03       ` Christian Brauner
  2025-03-17 16:59         ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2025-03-09 12:03 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Linus Torvalds, linux-kernel, Jonathan Corbet, Eric Biederman,
	Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro, Jan Kara,
	Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Jason Gunthorpe, Matthew Wilcox, Wei Yang, Andrew Morton,
	linux-fsdevel, linux-doc, linux-mm, kexec

On Sat, Mar 08, 2025 at 12:10:12AM +0000, Pratyush Yadav wrote:
> Hi Christian,
> 
> Thanks for the review!

No worries, I'm not trying to be polemic. It's just that this whole
proposed concept is pretty lightweight in terms of thinking about
possible implications.

> > This use-case is covered with systemd's fdstore and it's available to
> > unprivileged userspace. Stashing arbitrary file descriptors in the
> > kernel in this way isn't a good idea.
> 
> For one, it can't be arbitrary FDs, but only explicitly enabled ones.
> Beyond that, while not intended, there is no way to stop userspace from
> using it as a stash. Stashing FDs is a needed operation for this to
> work, and there is no way to guarantee in advance that userspace will
> actually use it for KHO, and not just stash it to grab back later.

As written it can't ever function as a generic file descriptor store.

It only allows fully privileged processes to stash file descriptors.
Which makes it useless for generic userspace. A generic fdstore should
have a model that makes it usable unprivileged it probably should also
be multi-instance and work easily with namespaces. This doesn't and
hitching it on devtmpfs and character devices is guaranteed to not work
well with such use-cases.

It also has big time security issues and implications. Any file you
stash in there will have the credentials of the opener attached to it.
So if someone stashes anything in there you need permission mechanisms
that ensures that Joe Random can't via FDBOX_GET_FD pull out a file for
e.g., someone else's cgroup and happily migrate processses under the
openers credentials or mess around some random executing binary.

So you need a model of who is allowed to pull out what file descriptors
from a file descriptor stash. What are the semantics for that? What's
the security model for that? What are possible corner cases?

For systemd's userspace fstore that's covered by policy it can implement
quite easily what fds it accepts. For the kernel it's a lot more
complicated.

If someone puts in file descriptors for a bunch of files in there opened
in different mount namespaces then this will pin said mount namespaces.
If the last process in the mount namespace exists the mount namespace
would be cleaned up but not anymore. The mount namespace would stay
pinned. Not wrong, but needs to be spelled out what the implications of
this are.

What if someone puts a file descriptor from devtmpfs or for /dev/fdbox
into an fdbox? Even if that's blocked, what happens if someone creates a
detached bind-mount of a /dev/fdbox mount and mounts it into a different
mount namespace and then puts a file descriptor for that mount namespace
into the fdbox? Tons of other scenarios come to mind. Ignoring when
networking is brought into the mix as well.

It's not done by just letting the kernel stash some files and getting
them out later somehow and then see whether it's somehow useful in the
future for other stuff. A generic globally usable fdstore is not
happening without a clear and detailed analysis what the semantics are
going to be.

So either that work is done right from the start or that stashing files
goes out the window and instead that KHO part is implemented in a way
where during a KHO dump relevant userspace is notified that they must
now serialize their state into the serialization stash. And no files are
actually kept in there at all.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-08 11:09       ` Christian Brauner
@ 2025-03-17 16:46         ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2025-03-17 16:46 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pratyush Yadav, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Sat, Mar 08, 2025 at 12:09:53PM +0100, Christian Brauner wrote:
> On Fri, Mar 07, 2025 at 11:14:17AM -0400, Jason Gunthorpe wrote:
> > On Fri, Mar 07, 2025 at 10:31:39AM +0100, Christian Brauner wrote:
> > > On Fri, Mar 07, 2025 at 12:57:35AM +0000, Pratyush Yadav wrote:
> > > > The File Descriptor Box (FDBox) is a mechanism for userspace to name
> > > > file descriptors and give them over to the kernel to hold. They can
> > > > later be retrieved by passing in the same name.
> > > > 
> > > > The primary purpose of FDBox is to be used with Kexec Handover (KHO).
> > > > There are many kinds anonymous file descriptors in the kernel like
> > > > memfd, guest_memfd, iommufd, etc. that would be useful to be preserved
> > > > using KHO. To be able to do that, there needs to be a mechanism to label
> > > > FDs that allows userspace to set the label before doing KHO and to use
> > > > the label to map them back after KHO. FDBox achieves that purpose by
> > > > exposing a miscdevice which exposes ioctls to label and transfer FDs
> > > > between the kernel and userspace. FDBox is not intended to work with any
> > > > generic file descriptor. Support for each kind of FDs must be explicitly
> > > > enabled.
> > > 
> > > This makes no sense as a generic concept. If you want to restore shmem
> > > and possibly anonymous inodes files via KHO then tailor the solution to
> > > shmem and anon inodes but don't make this generic infrastructure. This
> > > has zero chances to cover generic files.
> > 
> > We need it to cover a range of FD types in the kernel like iommufd and
> 
> anonymous inode
> 
> > vfio.
> 
> anonymous inode

Yes, I think Pratyush did not really capture that point, that it is
really only for very limited FD types. Realistically probably only
anonymous like things.

> > It is not "generic" in the sense every FD in the kernel magicaly works
> > with fdbox, but that any driver/subsystem providing a FD could be
> > enlightened to support it.
> > 
> > Very much do not want the infrastructure tied to just shmem and memfd.
> 
> Anything you can reasonably want will either be an internal shmem mount,
> devtmpfs, or anonymous inodes. Anything else isn't going to work.

Yes.
 
> I'm not yet sold that this needs to be a character device. Because
> that's fundamentally limiting in how useful this can be.

It is part of KHO, and I think KHO wants a character device for other
reasons anyhow.

The whole concept is tied to KHO intrinsically because this new
file_operations callback is going to be calling KHO related functions
to register the information contained in the FD with KHO.

Also, I kind of expect it to be semi-destructive to the FDs in
someway, especially for VFIO and iommufd. The FD will have to be
prepared to go into the KHO first.

> It might be way more useful if this ended up being a separate tiny
> filesystem where such preserved files are simply shown as named entries
> that you can open instead of ioctl()ing your way through character
> devices. But I need to think about that.

It could be possible, but I think this is more complex, and not really
too useful. How do you store a iommufd anonymous inode in a new
special filesystem? What permissions does it have after kexec? How
does open work? What if you open the same path multiple times? What
about the single-open rules of VFIO? How do you "open" co-linked FDs
like VFIO & iommufd?

A char device can give pretty reasonable answers to these questions
when we don't have to pretend to be a filesytem..

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-09 12:03       ` Christian Brauner
@ 2025-03-17 16:59         ` Jason Gunthorpe
  2025-03-18 14:25           ` Christian Brauner
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-03-17 16:59 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pratyush Yadav, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Sun, Mar 09, 2025 at 01:03:31PM +0100, Christian Brauner wrote:

> So either that work is done right from the start or that stashing files
> goes out the window and instead that KHO part is implemented in a way
> where during a KHO dump relevant userspace is notified that they must
> now serialize their state into the serialization stash. And no files are
> actually kept in there at all.

Let's ignore memfd/shmem for a moment..

It is not userspace state that is being serialized, it is *kernel*
state inside device drivers like VFIO/iommufd/kvm/etc that is being
serialized to the KHO.

The file descriptor is simply the handle to the kernel state. It is
not a "file" in any normal filesystem sense, it is just an uAPI handle
for a char dev that is used with IOCTL.

When KHO is triggered triggered whatever is contained inside the FD is
serialized into the KHO.

So we need:
 1) A way to register FDs to be serialized. For instance, not every
    VFIO FD should be retained.
 2) A way for the kexecing kernel to make callbacks to the char dev
    owner (probably via struct file operations) to perform the
    serialization
 3) A way for the new kernel to ask the char dev owner to create a new
    struct file out of the serialized data. Probably allowed to happen
    only once, ie you can't clone these things. This is not the same
    as just opening an empty char device, it would also fill the char
    device with whatever data was serialized.
 4) A way to get the struct file into a process fd number so userspace
    can route it to the right place.

It is not really a stash, it is not keeping files, it is hardwired to
KHO to drive it's serialize/deserialize mechanism around char devs in
a very limited way.

If you have that then feeding an anonymous memfd/guestmemfd through
the same machinery is a fairly small and logical step.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-17 16:59         ` Jason Gunthorpe
@ 2025-03-18 14:25           ` Christian Brauner
  2025-03-18 14:57             ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Christian Brauner @ 2025-03-18 14:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Pratyush Yadav, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Mon, Mar 17, 2025 at 01:59:05PM -0300, Jason Gunthorpe wrote:
> On Sun, Mar 09, 2025 at 01:03:31PM +0100, Christian Brauner wrote:
> 
> > So either that work is done right from the start or that stashing files
> > goes out the window and instead that KHO part is implemented in a way
> > where during a KHO dump relevant userspace is notified that they must
> > now serialize their state into the serialization stash. And no files are
> > actually kept in there at all.
> 
> Let's ignore memfd/shmem for a moment..
> 
> It is not userspace state that is being serialized, it is *kernel*
> state inside device drivers like VFIO/iommufd/kvm/etc that is being
> serialized to the KHO.
> 
> The file descriptor is simply the handle to the kernel state. It is
> not a "file" in any normal filesystem sense, it is just an uAPI handle
> for a char dev that is used with IOCTL.
> 
> When KHO is triggered triggered whatever is contained inside the FD is
> serialized into the KHO.
> 
> So we need:
>  1) A way to register FDs to be serialized. For instance, not every
>     VFIO FD should be retained.
>  2) A way for the kexecing kernel to make callbacks to the char dev
>     owner (probably via struct file operations) to perform the
>     serialization
>  3) A way for the new kernel to ask the char dev owner to create a new
>     struct file out of the serialized data. Probably allowed to happen
>     only once, ie you can't clone these things. This is not the same
>     as just opening an empty char device, it would also fill the char
>     device with whatever data was serialized.
>  4) A way to get the struct file into a process fd number so userspace
>     can route it to the right place.
> 
> It is not really a stash, it is not keeping files, it is hardwired to

Right now as written it is keeping references to files in these fdboxes
and thus functioning both as a crippled high-privileged fdstore and a
serialization mechanism. Please get rid of the fdstore bits and
implement it in a way that it serializes files without stashing
references to live files that can at arbitrary points in time before the
fdbox is "sealed" be pulled out and installed into the caller's fdtable
again.

> KHO to drive it's serialize/deserialize mechanism around char devs in
> a very limited way.
> 
> If you have that then feeding an anonymous memfd/guestmemfd through
> the same machinery is a fairly small and logical step.
> 
> Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-18 14:25           ` Christian Brauner
@ 2025-03-18 14:57             ` Jason Gunthorpe
  2025-03-18 23:02               ` Pratyush Yadav
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 14:57 UTC (permalink / raw)
  To: Christian Brauner
  Cc: Pratyush Yadav, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Tue, Mar 18, 2025 at 03:25:25PM +0100, Christian Brauner wrote:

> > It is not really a stash, it is not keeping files, it is hardwired to
> 
> Right now as written it is keeping references to files in these fdboxes
> and thus functioning both as a crippled high-privileged fdstore and a
> serialization mechanism. 

I think Pratyush went a bit overboard on that, I can see it is useful
for testing, but really the kho control FD should be in either
serializing or deserializing mode and it should not really act as an
FD store.

However, edge case handling makes this a bit complicated. 

Once a FD is submitted to be serialized that FD has to be frozen and
can't be allowed to change anymore.

If the kexec process aborts then we need to unwind all of this stuff
and unfreeze all the FDs.

It sure would be nice if the freezing process could be managed
generically somehow.

One option for freezing would have the kernel enforce that userspace
has closed and idled the FD everywhere (eg check the struct file
refcount == 1). If userspace doesn't have access to the FD then it is
effectively frozen.

In this case the error path would need to bring the FD back out of the
fdbox.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-18 14:57             ` Jason Gunthorpe
@ 2025-03-18 23:02               ` Pratyush Yadav
  2025-03-18 23:27                 ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-18 23:02 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Brauner, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Tue, Mar 18 2025, Jason Gunthorpe wrote:

> On Tue, Mar 18, 2025 at 03:25:25PM +0100, Christian Brauner wrote:
>
>> > It is not really a stash, it is not keeping files, it is hardwired to
>> 
>> Right now as written it is keeping references to files in these fdboxes
>> and thus functioning both as a crippled high-privileged fdstore and a
>> serialization mechanism. 
>
> I think Pratyush went a bit overboard on that, I can see it is useful
> for testing, but really the kho control FD should be in either
> serializing or deserializing mode and it should not really act as an
> FD store.
>
> However, edge case handling makes this a bit complicated. 
>
> Once a FD is submitted to be serialized that FD has to be frozen and
> can't be allowed to change anymore.
>
> If the kexec process aborts then we need to unwind all of this stuff
> and unfreeze all the FDs.

I do think I might have went a bit overboard, but this was one of the
reasons for doing so. Having the struct file around, and having the
ability to map it back in allowed for kexec failure to be recoverable
easily and quickly.

I suppose we can serialize all FDs when the box is sealed and get rid of
the struct file. If kexec fails, userspace can unseal the box, and FDs
will be deserialized into a new struct file. This way, the behaviour
from userspace perspective also stays the same regardless of whether
kexec went through or not. This also helps tie FDBox closer to KHO.

The downside is that the recovery time will be slower since the state
has to be deserialized, but I suppose kexec failure should not happen
too often so that is something we can live with.

What do you think about doing it this way?

>
> It sure would be nice if the freezing process could be managed
> generically somehow.
>
> One option for freezing would have the kernel enforce that userspace
> has closed and idled the FD everywhere (eg check the struct file
> refcount == 1). If userspace doesn't have access to the FD then it is
> effectively frozen.

Yes, that is what I want to do in the next revision. FDBox itself will
not close the file descriptors when you put a FD in the box. It will
just grab a reference and let the userspace close the FD. Then when the
box is sealed, the operation can be refused if refcount != 1.

>
> In this case the error path would need to bring the FD back out of the
> fdbox.
>
> Jason
>

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-18 23:02               ` Pratyush Yadav
@ 2025-03-18 23:27                 ` Jason Gunthorpe
  2025-03-19 13:35                   ` Pratyush Yadav
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-03-18 23:27 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Christian Brauner, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Tue, Mar 18, 2025 at 11:02:31PM +0000, Pratyush Yadav wrote:

> I suppose we can serialize all FDs when the box is sealed and get rid of
> the struct file. If kexec fails, userspace can unseal the box, and FDs
> will be deserialized into a new struct file. This way, the behaviour
> from userspace perspective also stays the same regardless of whether
> kexec went through or not. This also helps tie FDBox closer to KHO.

I don't think we can do a proper de-serialization without going
through kexec. The new stuff Mike is posting for preserving memory
will not work like that.

I think error recovery wil have to work by just restoring access to
the FD and it's driver state that was never actually destroyed.

> > It sure would be nice if the freezing process could be managed
> > generically somehow.
> >
> > One option for freezing would have the kernel enforce that userspace
> > has closed and idled the FD everywhere (eg check the struct file
> > refcount == 1). If userspace doesn't have access to the FD then it is
> > effectively frozen.
> 
> Yes, that is what I want to do in the next revision. FDBox itself will
> not close the file descriptors when you put a FD in the box. It will
> just grab a reference and let the userspace close the FD. Then when the
> box is sealed, the operation can be refused if refcount != 1.

I'm not sure about this sealed idea..

One of the design points here was to have different phases for the KHO
process and we want to shift alot of work to the earlier phases. Some
of that work should be putting things into the fdbox, freezing them,
and writing out the serialzation as that may be quite time consuming.

The same is true for the deserialize step where we don't want to bulk
deserialize but do it in an ordered way to minimize the critical
downtime.

So I'm not sure if a 'seal' operation that goes and bulk serializes
everything makes sense. I still haven't seen a state flow chart and a
proposal where all the different required steps would have to land to
get any certainty here.

At least in my head I imagined you'd open the KHO FD, put it in
serializing mode and then go through in the right order pushing all
the work and building the serializion data structure as you go.

At the very end you'd finalize the KHO serialization, which just
writes out a little bit more to the FDT and gives you back the FDT
blob for the kexec. It should be a very fast operation.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-18 23:27                 ` Jason Gunthorpe
@ 2025-03-19 13:35                   ` Pratyush Yadav
  2025-03-20 12:14                     ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-19 13:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Brauner, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Tue, Mar 18 2025, Jason Gunthorpe wrote:

> On Tue, Mar 18, 2025 at 11:02:31PM +0000, Pratyush Yadav wrote:
>
>> I suppose we can serialize all FDs when the box is sealed and get rid of
>> the struct file. If kexec fails, userspace can unseal the box, and FDs
>> will be deserialized into a new struct file. This way, the behaviour
>> from userspace perspective also stays the same regardless of whether
>> kexec went through or not. This also helps tie FDBox closer to KHO.
>
> I don't think we can do a proper de-serialization without going
> through kexec. The new stuff Mike is posting for preserving memory
> will not work like that.

Why not? If the next kernel can restore the file from the serialized
content, so can the current kernel. What stops this from working with
the new memory preservation scheme (which I assume is the idea you
proposed in [0])? In that, kho_preserve_folio() marks a page to be
preserved across KHO. We can have a kho_restore_folio() function that
removes the reservation from the xarray and returns the folio to the
caller. The KHO machinery takes care of abstracting the detail of
whether kexec actually happened. With that in place, I don't see why we
can't deserialize without going through kexec.

>
> I think error recovery wil have to work by just restoring access to
> the FD and it's driver state that was never actually destroyed.
>
>> > It sure would be nice if the freezing process could be managed
>> > generically somehow.
>> >
>> > One option for freezing would have the kernel enforce that userspace
>> > has closed and idled the FD everywhere (eg check the struct file
>> > refcount == 1). If userspace doesn't have access to the FD then it is
>> > effectively frozen.
>> 
>> Yes, that is what I want to do in the next revision. FDBox itself will
>> not close the file descriptors when you put a FD in the box. It will
>> just grab a reference and let the userspace close the FD. Then when the
>> box is sealed, the operation can be refused if refcount != 1.
>
> I'm not sure about this sealed idea..
>
> One of the design points here was to have different phases for the KHO
> process and we want to shift alot of work to the earlier phases. Some
> of that work should be putting things into the fdbox, freezing them,
> and writing out the serialzation as that may be quite time consuming.
>
> The same is true for the deserialize step where we don't want to bulk
> deserialize but do it in an ordered way to minimize the critical
> downtime.
>
> So I'm not sure if a 'seal' operation that goes and bulk serializes
> everything makes sense. I still haven't seen a state flow chart and a
> proposal where all the different required steps would have to land to
> get any certainty here.

The seal operation does bulk serialize/deserialize for _one_ box. You
can have multiple boxes and distribute your FDs in the boxes based on
the serialize or deserialize order you want. Userspace decides when to
seal or unseal a particular box, which gives it full control over the
order in which things happen.

>
> At least in my head I imagined you'd open the KHO FD, put it in
> serializing mode and then go through in the right order pushing all
> the work and building the serializion data structure as you go.

If we serialize the box at seal time, this is exactly how things will be
done. Before KHO activate happens, userspace can start putting in FDs
and start serializing things. Then when activation happens, the
box-level metadata gets quickly written out to the main FDT and that's
it. The bulk of the per-fd work should already be done.

We can even have something like FDBOX_PREPARE_FD or FDBOX_PREPARE_BOX
that pre-serializes as much as it can before anything is actually
frozen, so the actual freeze is faster. This is similar to pre-copy
during live migration for example.

All of this is made easier if each component has its own FDT (or any
other data structure) and doesn't have to share the same FDT. This is
the direction we are going in anyway with the next KHO versions.

>
> At the very end you'd finalize the KHO serialization, which just
> writes out a little bit more to the FDT and gives you back the FDT
> blob for the kexec. It should be a very fast operation.
>
> Jason
>

[0] https://lore.kernel.org/lkml/20250212152336.GA3848889@nvidia.com/

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-19 13:35                   ` Pratyush Yadav
@ 2025-03-20 12:14                     ` Jason Gunthorpe
  2025-03-26 22:40                       ` Pratyush Yadav
  0 siblings, 1 reply; 29+ messages in thread
From: Jason Gunthorpe @ 2025-03-20 12:14 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Christian Brauner, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Wed, Mar 19, 2025 at 01:35:31PM +0000, Pratyush Yadav wrote:
> On Tue, Mar 18 2025, Jason Gunthorpe wrote:
> 
> > On Tue, Mar 18, 2025 at 11:02:31PM +0000, Pratyush Yadav wrote:
> >
> >> I suppose we can serialize all FDs when the box is sealed and get rid of
> >> the struct file. If kexec fails, userspace can unseal the box, and FDs
> >> will be deserialized into a new struct file. This way, the behaviour
> >> from userspace perspective also stays the same regardless of whether
> >> kexec went through or not. This also helps tie FDBox closer to KHO.
> >
> > I don't think we can do a proper de-serialization without going
> > through kexec. The new stuff Mike is posting for preserving memory
> > will not work like that.
> 
> Why not? If the next kernel can restore the file from the serialized
> content, so can the current kernel. What stops this from working with
> the new memory preservation scheme (which I assume is the idea you
> proposed in [0])? 

It is because the current kernel does not destroy the struct page
before the kexec and the new kernel assumes a zero'd fresh struct page
at restore.

So it would be very easy to corrupt the struct page information if you
attempt to deserialize without going through the kexec step.

There would be a big risk of getting things like refcounts out of
sync.

Then you have the issue that I don't actually imagine shutting down
something like iommufd, I was intending to leave it frozen in place
with all its allocations and so on. If you try to de-serialize you
can't de-serialize into the thing that is frozen, you'd create a new
one from empty. Now you have two things pointing at the same stuff,
what a mess.

> The seal operation does bulk serialize/deserialize for _one_ box. You
> can have multiple boxes and distribute your FDs in the boxes based on
> the serialize or deserialize order you want. Userspace decides when to
> seal or unseal a particular box, which gives it full control over the
> order in which things happen.

Why have more than one box? What is the point? I've been thinking we
should just have a KHO control char dev FD for serializing and you can
do all the operations people have been talking about in sysfs, as well
as record FDs for serializing.

Why do we need more than one fdbox container fd?

> All of this is made easier if each component has its own FDT (or any
> other data structure) and doesn't have to share the same FDT. This is
> the direction we are going in anyway with the next KHO versions.

Yes, I agree with that for sure.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-20 12:14                     ` Jason Gunthorpe
@ 2025-03-26 22:40                       ` Pratyush Yadav
  2025-03-31 15:38                         ` Jason Gunthorpe
  0 siblings, 1 reply; 29+ messages in thread
From: Pratyush Yadav @ 2025-03-26 22:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Christian Brauner, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Thu, Mar 20 2025, Jason Gunthorpe wrote:

> On Wed, Mar 19, 2025 at 01:35:31PM +0000, Pratyush Yadav wrote:
>> On Tue, Mar 18 2025, Jason Gunthorpe wrote:
>> 
>> > On Tue, Mar 18, 2025 at 11:02:31PM +0000, Pratyush Yadav wrote:
>> >
>> >> I suppose we can serialize all FDs when the box is sealed and get rid of
>> >> the struct file. If kexec fails, userspace can unseal the box, and FDs
>> >> will be deserialized into a new struct file. This way, the behaviour
>> >> from userspace perspective also stays the same regardless of whether
>> >> kexec went through or not. This also helps tie FDBox closer to KHO.
>> >
>> > I don't think we can do a proper de-serialization without going
>> > through kexec. The new stuff Mike is posting for preserving memory
>> > will not work like that.
>> 
>> Why not? If the next kernel can restore the file from the serialized
>> content, so can the current kernel. What stops this from working with
>> the new memory preservation scheme (which I assume is the idea you
>> proposed in [0])? 
>
> It is because the current kernel does not destroy the struct page
> before the kexec and the new kernel assumes a zero'd fresh struct page
> at restore.
>
> So it would be very easy to corrupt the struct page information if you
> attempt to deserialize without going through the kexec step.
>
> There would be a big risk of getting things like refcounts out of
> sync.

Ideally, kho_preserve_folio() should be similar to freeing the folio,
except that it doesn't go to buddy for re-allocation. In that case,
re-using those pages should not be a problem as long as the driver made
sure the page was properly "freed", and there are no stale references to
it. They should be doing that anyway since they should make sure the
file doesn't change after it has been serialized.

Doing that might be easier said than done though. On a quick look, most
of the clearing of struct page seems to be happening in
free_pages_prepare(). This is usually followed by free_one_page(), which
gives the page back to buddy. Though I am not sure how much sense it
would make to use free_pages_prepare() outside of page free path. I need
to look deeper...

>
> Then you have the issue that I don't actually imagine shutting down
> something like iommufd, I was intending to leave it frozen in place
> with all its allocations and so on. If you try to de-serialize you
> can't de-serialize into the thing that is frozen, you'd create a new
> one from empty. Now you have two things pointing at the same stuff,
> what a mess.

What do you mean by "frozen in place"? Isn't that the same as being
serialized? Considering that we want to make sure a file is not opened
by any process before we serialize it, what do we get by keeping the
struct file around (assuming we can safely deserialize it without going
through kexec)?

>
>> The seal operation does bulk serialize/deserialize for _one_ box. You
>> can have multiple boxes and distribute your FDs in the boxes based on
>> the serialize or deserialize order you want. Userspace decides when to
>> seal or unseal a particular box, which gives it full control over the
>> order in which things happen.
>
> Why have more than one box? What is the point? I've been thinking we
> should just have a KHO control char dev FD for serializing and you can
> do all the operations people have been talking about in sysfs, as well
> as record FDs for serializing.

Main idea is for logical grouping and dependency management. If some FDs
have a dependency between them, grouping them in different boxes makes
it easy to let userspace choose the order of operations, but still have
a way to make sure all dependencies are met when the FDs are serialized.
Similarly, on the deserialize side, this ensures that all dependent FDs
are deserialized together.

[...]

-- 
Regards,
Pratyush Yadav

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [RFC PATCH 1/5] misc: introduce FDBox
  2025-03-26 22:40                       ` Pratyush Yadav
@ 2025-03-31 15:38                         ` Jason Gunthorpe
  0 siblings, 0 replies; 29+ messages in thread
From: Jason Gunthorpe @ 2025-03-31 15:38 UTC (permalink / raw)
  To: Pratyush Yadav
  Cc: Christian Brauner, Linus Torvalds, linux-kernel, Jonathan Corbet,
	Eric Biederman, Arnd Bergmann, Greg Kroah-Hartman, Alexander Viro,
	Jan Kara, Hugh Dickins, Alexander Graf, Benjamin Herrenschmidt,
	David Woodhouse, James Gowans, Mike Rapoport, Paolo Bonzini,
	Pasha Tatashin, Anthony Yznaga, Dave Hansen, David Hildenbrand,
	Matthew Wilcox, Wei Yang, Andrew Morton, linux-fsdevel, linux-doc,
	linux-mm, kexec

On Wed, Mar 26, 2025 at 10:40:29PM +0000, Pratyush Yadav wrote:
> Ideally, kho_preserve_folio() should be similar to freeing the folio,
> except that it doesn't go to buddy for re-allocation. In that case,
> re-using those pages should not be a problem as long as the driver made
> sure the page was properly "freed", and there are no stale references to
> it. They should be doing that anyway since they should make sure the
> file doesn't change after it has been serialized.

I don't know if this is a good idea, it seems to make error recovery
much more complex.

> > Then you have the issue that I don't actually imagine shutting down
> > something like iommufd, I was intending to leave it frozen in place
> > with all its allocations and so on. If you try to de-serialize you
> > can't de-serialize into the thing that is frozen, you'd create a new
> > one from empty. Now you have two things pointing at the same stuff,
> > what a mess.
> 
> What do you mean by "frozen in place"? Isn't that the same as being
> serialized? 

I mean all the memory and internal state is still there, it is just
not changing. It is not the same as being serialized, as the
de-serialized versions of everything would still exist in parallel.

> Considering that we want to make sure a file is not opened by any
> process before we serialize it, what do we get by keeping the struct
> file around (assuming we can safely deserialize it without going
> through kexec)?

We do alot less work.

Having serialize reliably but the entire system into a fully
post-live-update state, including dependent things like the
iommufd/vfio attachment and iommu driver, is very hard. This stuff is
quite complex.

I imagine instead we have three data states
 - Fully operating
 - Frozen and all preserved memory logged in KHO
 - post-live-update where there are hints scattered around the drivers
   about what is in the KHO

From an error prespective going from frozen back to fully operating
should just be throwing away the KHO record and allowing use of the FD
again. That is super simply and makes error recovery during
micro-steps of the KHO simple and safe.

If you imagine that KHO is destructive then every failure point needs
to unwind the partial destruction which is a total nightmare to code :\

> Main idea is for logical grouping and dependency management. If some FDs
> have a dependency between them, grouping them in different boxes makes
> it easy to let userspace choose the order of operations, but still have
> a way to make sure all dependencies are met when the FDs are serialized.
> Similarly, on the deserialize side, this ensures that all dependent FDs
> are deserialized together.

That seems over complicated to me. Userspace should write the FDs in
the required order and that should be a topological sort of the
required dependencies. kernel should just validate this was done.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2025-03-31 15:38 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-07  0:57 [RFC PATCH 0/5] Introduce FDBox, and preserve memfd with shmem over KHO Pratyush Yadav
2025-03-07  0:57 ` [RFC PATCH 1/5] misc: introduce FDBox Pratyush Yadav
2025-03-07  6:03   ` Greg Kroah-Hartman
2025-03-07  9:31   ` Christian Brauner
2025-03-07 13:19     ` Christian Brauner
2025-03-07 15:14     ` Jason Gunthorpe
2025-03-08 11:09       ` Christian Brauner
2025-03-17 16:46         ` Jason Gunthorpe
2025-03-08  0:10     ` Pratyush Yadav
2025-03-09 12:03       ` Christian Brauner
2025-03-17 16:59         ` Jason Gunthorpe
2025-03-18 14:25           ` Christian Brauner
2025-03-18 14:57             ` Jason Gunthorpe
2025-03-18 23:02               ` Pratyush Yadav
2025-03-18 23:27                 ` Jason Gunthorpe
2025-03-19 13:35                   ` Pratyush Yadav
2025-03-20 12:14                     ` Jason Gunthorpe
2025-03-26 22:40                       ` Pratyush Yadav
2025-03-31 15:38                         ` Jason Gunthorpe
2025-03-07  0:57 ` [RFC PATCH 2/5] misc: add documentation for FDBox Pratyush Yadav
2025-03-07  2:19   ` Randy Dunlap
2025-03-07 15:03     ` Pratyush Yadav
2025-03-07 14:22   ` Jonathan Corbet
2025-03-07 14:51     ` Pratyush Yadav
2025-03-07 15:25       ` Jonathan Corbet
2025-03-07 23:28         ` Pratyush Yadav
2025-03-07  0:57 ` [RFC PATCH 3/5] mm: shmem: allow callers to specify operations to shmem_undo_range Pratyush Yadav
2025-03-07  0:57 ` [RFC PATCH 4/5] mm: shmem: allow preserving file over FDBOX + KHO Pratyush Yadav
2025-03-07  0:57 ` [RFC PATCH 5/5] mm/memfd: allow preserving FD " Pratyush Yadav

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).