* [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-06-25 4:47 ` Bagas Sanjaya
2024-07-26 14:30 ` Jonathan Cameron
2024-06-24 22:47 ` [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device Jason Gunthorpe
` (7 subsequent siblings)
8 siblings, 2 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
Create the class, character device and functions for a fwctl driver to
un/register to the subsystem.
A typical fwctl driver has a sysfs presence like:
$ ls -l /dev/fwctl/fwctl0
crw------- 1 root root 250, 0 Apr 25 19:16 /dev/fwctl/fwctl0
$ ls /sys/class/fwctl/fwctl0
dev device power subsystem uevent
$ ls /sys/class/fwctl/fwctl0/device/infiniband/
ibp0s10f0
$ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
fwctl0/
$ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
dev device power subsystem uevent
Which allows userspace to link all the multi-subsystem driver components
together and learn the subsystem specific names for the device's
components.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
MAINTAINERS | 8 ++
drivers/Kconfig | 2 +
drivers/Makefile | 1 +
drivers/fwctl/Kconfig | 9 +++
drivers/fwctl/Makefile | 4 +
drivers/fwctl/main.c | 177 +++++++++++++++++++++++++++++++++++++++++
include/linux/fwctl.h | 68 ++++++++++++++++
7 files changed, 269 insertions(+)
create mode 100644 drivers/fwctl/Kconfig
create mode 100644 drivers/fwctl/Makefile
create mode 100644 drivers/fwctl/main.c
create mode 100644 include/linux/fwctl.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 2ca8f35dfe0399..aa7a760d12f8ef 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9076,6 +9076,14 @@ F: kernel/futex/*
F: tools/perf/bench/futex*
F: tools/testing/selftests/futex/
+FWCTL SUBSYSTEM
+M: Jason Gunthorpe <jgg@nvidia.com>
+M: Saeed Mahameed <saeedm@nvidia.com>
+S: Maintained
+F: Documentation/userspace-api/fwctl.rst
+F: drivers/fwctl/
+F: include/linux/fwctl.h
+
GALAXYCORE GC0308 CAMERA SENSOR DRIVER
M: Sebastian Reichel <sre@kernel.org>
L: linux-media@vger.kernel.org
diff --git a/drivers/Kconfig b/drivers/Kconfig
index 7bdad836fc6207..7c556c5ac4fddc 100644
--- a/drivers/Kconfig
+++ b/drivers/Kconfig
@@ -21,6 +21,8 @@ source "drivers/connector/Kconfig"
source "drivers/firmware/Kconfig"
+source "drivers/fwctl/Kconfig"
+
source "drivers/gnss/Kconfig"
source "drivers/mtd/Kconfig"
diff --git a/drivers/Makefile b/drivers/Makefile
index fe9ceb0d2288ad..f6a241b747b29c 100644
--- a/drivers/Makefile
+++ b/drivers/Makefile
@@ -133,6 +133,7 @@ obj-$(CONFIG_MEMSTICK) += memstick/
obj-y += leds/
obj-$(CONFIG_INFINIBAND) += infiniband/
obj-y += firmware/
+obj-$(CONFIG_FWCTL) += fwctl/
obj-$(CONFIG_CRYPTO) += crypto/
obj-$(CONFIG_SUPERH) += sh/
obj-y += clocksource/
diff --git a/drivers/fwctl/Kconfig b/drivers/fwctl/Kconfig
new file mode 100644
index 00000000000000..37147a695add9a
--- /dev/null
+++ b/drivers/fwctl/Kconfig
@@ -0,0 +1,9 @@
+# SPDX-License-Identifier: GPL-2.0-only
+menuconfig FWCTL
+ tristate "fwctl device firmware access framework"
+ help
+ fwctl provides a userspace API for restricted access to communicate
+ with on-device firmware. The communication channel is intended to
+ support a wide range of lockdown compatible device behaviors including
+ manipulating device FLASH, debugging, and other activities that don't
+ fit neatly into an existing subsystem.
diff --git a/drivers/fwctl/Makefile b/drivers/fwctl/Makefile
new file mode 100644
index 00000000000000..1cad210f6ba580
--- /dev/null
+++ b/drivers/fwctl/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_FWCTL) += fwctl.o
+
+fwctl-y += main.o
diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
new file mode 100644
index 00000000000000..6e9bf15c743b5c
--- /dev/null
+++ b/drivers/fwctl/main.c
@@ -0,0 +1,177 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
+ */
+#define pr_fmt(fmt) "fwctl: " fmt
+#include <linux/fwctl.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/container_of.h>
+#include <linux/fs.h>
+
+enum {
+ FWCTL_MAX_DEVICES = 256,
+};
+static dev_t fwctl_dev;
+static DEFINE_IDA(fwctl_ida);
+
+static int fwctl_fops_open(struct inode *inode, struct file *filp)
+{
+ struct fwctl_device *fwctl =
+ container_of(inode->i_cdev, struct fwctl_device, cdev);
+
+ get_device(&fwctl->dev);
+ filp->private_data = fwctl;
+ return 0;
+}
+
+static int fwctl_fops_release(struct inode *inode, struct file *filp)
+{
+ struct fwctl_device *fwctl = filp->private_data;
+
+ fwctl_put(fwctl);
+ return 0;
+}
+
+static const struct file_operations fwctl_fops = {
+ .owner = THIS_MODULE,
+ .open = fwctl_fops_open,
+ .release = fwctl_fops_release,
+};
+
+static void fwctl_device_release(struct device *device)
+{
+ struct fwctl_device *fwctl =
+ container_of(device, struct fwctl_device, dev);
+
+ ida_free(&fwctl_ida, fwctl->dev.devt - fwctl_dev);
+ kfree(fwctl);
+}
+
+static char *fwctl_devnode(const struct device *dev, umode_t *mode)
+{
+ return kasprintf(GFP_KERNEL, "fwctl/%s", dev_name(dev));
+}
+
+static struct class fwctl_class = {
+ .name = "fwctl",
+ .dev_release = fwctl_device_release,
+ .devnode = fwctl_devnode,
+};
+
+static struct fwctl_device *
+_alloc_device(struct device *parent, const struct fwctl_ops *ops, size_t size)
+{
+ struct fwctl_device *fwctl __free(kfree) = kzalloc(size, GFP_KERNEL);
+ int devnum;
+
+ if (!fwctl)
+ return NULL;
+ fwctl->dev.class = &fwctl_class;
+ fwctl->dev.parent = parent;
+
+ devnum = ida_alloc_max(&fwctl_ida, FWCTL_MAX_DEVICES - 1, GFP_KERNEL);
+ if (devnum < 0)
+ return NULL;
+ fwctl->dev.devt = fwctl_dev + devnum;
+
+ device_initialize(&fwctl->dev);
+ return_ptr(fwctl);
+}
+
+/* Drivers use the fwctl_alloc_device() wrapper */
+struct fwctl_device *_fwctl_alloc_device(struct device *parent,
+ const struct fwctl_ops *ops,
+ size_t size)
+{
+ struct fwctl_device *fwctl __free(fwctl) =
+ _alloc_device(parent, ops, size);
+
+ if (!fwctl)
+ return NULL;
+
+ cdev_init(&fwctl->cdev, &fwctl_fops);
+ fwctl->cdev.owner = THIS_MODULE;
+
+ if (dev_set_name(&fwctl->dev, "fwctl%d", fwctl->dev.devt - fwctl_dev))
+ return NULL;
+
+ fwctl->ops = ops;
+ return_ptr(fwctl);
+}
+EXPORT_SYMBOL_NS_GPL(_fwctl_alloc_device, FWCTL);
+
+/**
+ * fwctl_register - Register a new device to the subsystem
+ * @fwctl: Previously allocated fwctl_device
+ *
+ * On return the device is visible through sysfs and /dev, driver ops may be
+ * called.
+ */
+int fwctl_register(struct fwctl_device *fwctl)
+{
+ int ret;
+
+ ret = cdev_device_add(&fwctl->cdev, &fwctl->dev);
+ if (ret)
+ return ret;
+ return 0;
+}
+EXPORT_SYMBOL_NS_GPL(fwctl_register, FWCTL);
+
+/**
+ * fwctl_unregister - Unregister a device from the subsystem
+ * @fwctl: Previously allocated and registered fwctl_device
+ *
+ * Undoes fwctl_register(). On return no driver ops will be called. The
+ * caller must still call fwctl_put() to free the fwctl.
+ *
+ * Unregister will return even if userspace still has file descriptors open.
+ * This will call ops->close_uctx() on any open FDs and after return no driver
+ * op will be called. The FDs remain open but all fops will return -ENODEV.
+ *
+ * The design of fwctl allows this sort of disassociation of the driver from the
+ * subsystem primarily by keeping memory allocations owned by the core subsytem.
+ * The fwctl_device and fwctl_uctx can both be freed without requiring a driver
+ * callback. This allows the module to remain unlocked while FDs are open.
+ */
+void fwctl_unregister(struct fwctl_device *fwctl)
+{
+ cdev_device_del(&fwctl->cdev, &fwctl->dev);
+
+ /*
+ * The driver module may unload after this returns, the op pointer will
+ * not be valid.
+ */
+ fwctl->ops = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(fwctl_unregister, FWCTL);
+
+static int __init fwctl_init(void)
+{
+ int ret;
+
+ ret = alloc_chrdev_region(&fwctl_dev, 0, FWCTL_MAX_DEVICES, "fwctl");
+ if (ret)
+ return ret;
+
+ ret = class_register(&fwctl_class);
+ if (ret)
+ goto err_chrdev;
+ return 0;
+
+err_chrdev:
+ unregister_chrdev_region(fwctl_dev, FWCTL_MAX_DEVICES);
+ return ret;
+}
+
+static void __exit fwctl_exit(void)
+{
+ class_unregister(&fwctl_class);
+ unregister_chrdev_region(fwctl_dev, FWCTL_MAX_DEVICES);
+}
+
+module_init(fwctl_init);
+module_exit(fwctl_exit);
+MODULE_DESCRIPTION("fwctl device firmware access framework");
+MODULE_LICENSE("GPL");
diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
new file mode 100644
index 00000000000000..ef4eaa87c945e4
--- /dev/null
+++ b/include/linux/fwctl.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
+ */
+#ifndef __LINUX_FWCTL_H
+#define __LINUX_FWCTL_H
+#include <linux/device.h>
+#include <linux/cdev.h>
+#include <linux/cleanup.h>
+
+struct fwctl_device;
+struct fwctl_uctx;
+
+struct fwctl_ops {
+};
+
+/**
+ * struct fwctl_device - Per-driver registration struct
+ * @dev: The sysfs (class/fwctl/fwctlXX) device
+ *
+ * Each driver instance will have one of these structs with the driver
+ * private data following immeidately after. This struct is refcounted,
+ * it is freed by calling fwctl_put().
+ */
+struct fwctl_device {
+ struct device dev;
+ /* private: */
+ struct cdev cdev;
+ const struct fwctl_ops *ops;
+};
+
+struct fwctl_device *_fwctl_alloc_device(struct device *parent,
+ const struct fwctl_ops *ops,
+ size_t size);
+/**
+ * fwctl_alloc_device - Allocate a fwctl
+ * @parent: Physical device that provides the FW interface
+ * @ops: Driver ops to register
+ * @drv_struct: 'struct driver_fwctl' that holds the struct fwctl_device
+ * @member: Name of the struct fwctl_device in @drv_struct
+ *
+ * This allocates and initializes the fwctl_device embedded in the drv_struct.
+ * Upon success the pointer must be freed via fwctl_put(). Returns NULL on
+ * failure. Returns a 'drv_struct *' on success, NULL on error.
+ */
+#define fwctl_alloc_device(parent, ops, drv_struct, member) \
+ container_of(_fwctl_alloc_device( \
+ parent, ops, \
+ sizeof(drv_struct) + \
+ BUILD_BUG_ON_ZERO( \
+ offsetof(drv_struct, member))), \
+ drv_struct, member)
+
+static inline struct fwctl_device *fwctl_get(struct fwctl_device *fwctl)
+{
+ get_device(&fwctl->dev);
+ return fwctl;
+}
+static inline void fwctl_put(struct fwctl_device *fwctl)
+{
+ put_device(&fwctl->dev);
+}
+DEFINE_FREE(fwctl, struct fwctl_device *, if (_T) fwctl_put(_T));
+
+int fwctl_register(struct fwctl_device *fwctl);
+void fwctl_unregister(struct fwctl_device *fwctl);
+
+#endif
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev
2024-06-24 22:47 ` [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev Jason Gunthorpe
@ 2024-06-25 4:47 ` Bagas Sanjaya
2024-07-22 16:04 ` Jason Gunthorpe
2024-07-26 14:30 ` Jonathan Cameron
1 sibling, 1 reply; 48+ messages in thread
From: Bagas Sanjaya @ 2024-06-25 4:47 UTC (permalink / raw)
To: Jason Gunthorpe, Jonathan Corbet, Itay Avraham, Jakub Kicinski,
Leon Romanovsky, linux-doc, linux-rdma, netdev, Paolo Abeni,
Saeed Mahameed, Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
[-- Attachment #1: Type: text/plain, Size: 1764 bytes --]
On Mon, Jun 24, 2024 at 07:47:25PM -0300, Jason Gunthorpe wrote:
> +/**
> + * fwctl_alloc_device - Allocate a fwctl
> + * @parent: Physical device that provides the FW interface
> + * @ops: Driver ops to register
> + * @drv_struct: 'struct driver_fwctl' that holds the struct fwctl_device
> + * @member: Name of the struct fwctl_device in @drv_struct
> + *
> + * This allocates and initializes the fwctl_device embedded in the drv_struct.
> + * Upon success the pointer must be freed via fwctl_put(). Returns NULL on
> + * failure. Returns a 'drv_struct *' on success, NULL on error.
> + */
Sphinx reports htmldocs warning:
Documentation/userspace-api/fwctl:195: ./include/linux/fwctl.h:72: WARNING: Inline emphasis start-string without end-string.
I have to escape the pointer (while also cleaning up redundant wordings on
error case):
---- >8 ----
diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
index 294cfbf63306a2..ddadbe15189b45 100644
--- a/include/linux/fwctl.h
+++ b/include/linux/fwctl.h
@@ -70,8 +70,8 @@ struct fwctl_device *_fwctl_alloc_device(struct device *parent,
* @member: Name of the struct fwctl_device in @drv_struct
*
* This allocates and initializes the fwctl_device embedded in the drv_struct.
- * Upon success the pointer must be freed via fwctl_put(). Returns NULL on
- * failure. Returns a 'drv_struct *' on success, NULL on error.
+ * Upon success the pointer must be freed via fwctl_put(). Returns a
+ * 'drv_struct \*' on success, NULL on error.
*/
#define fwctl_alloc_device(parent, ops, drv_struct, member) \
container_of(_fwctl_alloc_device( \
Thanks.
--
An old man doll... just what I always wanted! - Clara
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply related [flat|nested] 48+ messages in thread
* Re: [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev
2024-06-25 4:47 ` Bagas Sanjaya
@ 2024-07-22 16:04 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-22 16:04 UTC (permalink / raw)
To: Bagas Sanjaya
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Tue, Jun 25, 2024 at 11:47:11AM +0700, Bagas Sanjaya wrote:
> On Mon, Jun 24, 2024 at 07:47:25PM -0300, Jason Gunthorpe wrote:
> > +/**
> > + * fwctl_alloc_device - Allocate a fwctl
> > + * @parent: Physical device that provides the FW interface
> > + * @ops: Driver ops to register
> > + * @drv_struct: 'struct driver_fwctl' that holds the struct fwctl_device
> > + * @member: Name of the struct fwctl_device in @drv_struct
> > + *
> > + * This allocates and initializes the fwctl_device embedded in the drv_struct.
> > + * Upon success the pointer must be freed via fwctl_put(). Returns NULL on
> > + * failure. Returns a 'drv_struct *' on success, NULL on error.
> > + */
>
> Sphinx reports htmldocs warning:
>
> Documentation/userspace-api/fwctl:195: ./include/linux/fwctl.h:72: WARNING: Inline emphasis start-string without end-string.
>
> I have to escape the pointer (while also cleaning up redundant wordings on
> error case):
Got it thanks
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev
2024-06-24 22:47 ` [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev Jason Gunthorpe
2024-06-25 4:47 ` Bagas Sanjaya
@ 2024-07-26 14:30 ` Jonathan Cameron
2024-07-29 17:30 ` Jason Gunthorpe
1 sibling, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-26 14:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, 24 Jun 2024 19:47:25 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> Create the class, character device and functions for a fwctl driver to
> un/register to the subsystem.
>
> A typical fwctl driver has a sysfs presence like:
>
> $ ls -l /dev/fwctl/fwctl0
> crw------- 1 root root 250, 0 Apr 25 19:16 /dev/fwctl/fwctl0
>
> $ ls /sys/class/fwctl/fwctl0
> dev device power subsystem uevent
>
> $ ls /sys/class/fwctl/fwctl0/device/infiniband/
> ibp0s10f0
>
> $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
> fwctl0/
>
> $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
> dev device power subsystem uevent
>
> Which allows userspace to link all the multi-subsystem driver components
> together and learn the subsystem specific names for the device's
> components.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Hi Jason,
Mostly looking at this to get my head around what the details are,
but whilst I'm reading might as well offer some review comments.
I'm not a fan of too many mini patches as it makes it harder
to review rather than easier, but meh, I know others prefer
it this way. If you are going to do it though, comments
need to be carefully tracking what they are talking about.
Jonathan
...
> diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
> new file mode 100644
> index 00000000000000..6e9bf15c743b5c
> --- /dev/null
> +++ b/drivers/fwctl/main.c
> @@ -0,0 +1,177 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
> + */
> +#define pr_fmt(fmt) "fwctl: " fmt
> +#include <linux/fwctl.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/container_of.h>
> +#include <linux/fs.h>
Trivial: Pick an ordering scheme perhaps as then we know where you'd
like new headers to be added.
> +
> +enum {
> + FWCTL_MAX_DEVICES = 256,
> +};
> +static dev_t fwctl_dev;
> +static DEFINE_IDA(fwctl_ida);
> +static struct fwctl_device *
> +_alloc_device(struct device *parent, const struct fwctl_ops *ops, size_t size)
> +{
> + struct fwctl_device *fwctl __free(kfree) = kzalloc(size, GFP_KERNEL);
> + int devnum;
> +
> + if (!fwctl)
> + return NULL;
I'd put a blank line here.
> + fwctl->dev.class = &fwctl_class;
> + fwctl->dev.parent = parent;
> +
> + devnum = ida_alloc_max(&fwctl_ida, FWCTL_MAX_DEVICES - 1, GFP_KERNEL);
> + if (devnum < 0)
> + return NULL;
> + fwctl->dev.devt = fwctl_dev + devnum;
> +
> + device_initialize(&fwctl->dev);
> + return_ptr(fwctl);
> +}
> +
> +/* Drivers use the fwctl_alloc_device() wrapper */
> +struct fwctl_device *_fwctl_alloc_device(struct device *parent,
> + const struct fwctl_ops *ops,
> + size_t size)
> +{
> + struct fwctl_device *fwctl __free(fwctl) =
> + _alloc_device(parent, ops, size);
> +
> + if (!fwctl)
> + return NULL;
> +
> + cdev_init(&fwctl->cdev, &fwctl_fops);
> + fwctl->cdev.owner = THIS_MODULE;
Owned by fwctl core, not the parent driver? Perhaps a comment on why.
I guess related to the lifetime being independent of parent driver.
> +
> + if (dev_set_name(&fwctl->dev, "fwctl%d", fwctl->dev.devt - fwctl_dev))
> + return NULL;
> +
> + fwctl->ops = ops;
> + return_ptr(fwctl);
> +}
> +EXPORT_SYMBOL_NS_GPL(_fwctl_alloc_device, FWCTL);
> +
> +/**
> + * fwctl_register - Register a new device to the subsystem
> + * @fwctl: Previously allocated fwctl_device
> + *
> + * On return the device is visible through sysfs and /dev, driver ops may be
> + * called.
> + */
> +int fwctl_register(struct fwctl_device *fwctl)
> +{
> + int ret;
> +
> + ret = cdev_device_add(&fwctl->cdev, &fwctl->dev);
> + if (ret)
> + return ret;
> + return 0;
Doesn't look like this ever gets more complex so
return cdev_device_add(...)
If you expect to see more here in near future maybe fair enough
to keep the handling as is.
> +}
> +EXPORT_SYMBOL_NS_GPL(fwctl_register, FWCTL);
> +
> +/**
> + * fwctl_unregister - Unregister a device from the subsystem
> + * @fwctl: Previously allocated and registered fwctl_device
> + *
> + * Undoes fwctl_register(). On return no driver ops will be called. The
> + * caller must still call fwctl_put() to free the fwctl.
> + *
> + * Unregister will return even if userspace still has file descriptors open.
> + * This will call ops->close_uctx() on any open FDs and after return no driver
> + * op will be called. The FDs remain open but all fops will return -ENODEV.
Perhaps bring the docs in with the support? I got (briefly) confused
by the lack of a path to close_uctx() in here.
> + *
> + * The design of fwctl allows this sort of disassociation of the driver from the
> + * subsystem primarily by keeping memory allocations owned by the core subsytem.
> + * The fwctl_device and fwctl_uctx can both be freed without requiring a driver
> + * callback. This allows the module to remain unlocked while FDs are open.
> + */
> +void fwctl_unregister(struct fwctl_device *fwctl)
> +{
> + cdev_device_del(&fwctl->cdev, &fwctl->dev);
> +
> + /*
> + * The driver module may unload after this returns, the op pointer will
> + * not be valid.
> + */
> + fwctl->ops = NULL;
I'd bring that in with the logic doing close_uctx() etc as then it will align
with the comments that I'd also suggest only adding there (patch 2 I think).
> +}
> +EXPORT_SYMBOL_NS_GPL(fwctl_unregister, FWCTL);
> diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
> new file mode 100644
> index 00000000000000..ef4eaa87c945e4
> --- /dev/null
> +++ b/include/linux/fwctl.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
> + */
> +#ifndef __LINUX_FWCTL_H
> +#define __LINUX_FWCTL_H
> +#include <linux/device.h>
> +#include <linux/cdev.h>
> +#include <linux/cleanup.h>
> +
> +struct fwctl_device;
> +struct fwctl_uctx;
> +
> +struct fwctl_ops {
> +};
> +
> +/**
> + * struct fwctl_device - Per-driver registration struct
> + * @dev: The sysfs (class/fwctl/fwctlXX) device
> + *
> + * Each driver instance will have one of these structs with the driver
> + * private data following immeidately after. This struct is refcounted,
immediately
> + * it is freed by calling fwctl_put().
> + */
> +struct fwctl_device {
> + struct device dev;
> + /* private: */
> + struct cdev cdev;
> + const struct fwctl_ops *ops;
> +};
> +
> +struct fwctl_device *_fwctl_alloc_device(struct device *parent,
> + const struct fwctl_ops *ops,
> + size_t size);
> +/**
> + * fwctl_alloc_device - Allocate a fwctl
> + * @parent: Physical device that provides the FW interface
> + * @ops: Driver ops to register
> + * @drv_struct: 'struct driver_fwctl' that holds the struct fwctl_device
> + * @member: Name of the struct fwctl_device in @drv_struct
> + *
> + * This allocates and initializes the fwctl_device embedded in the drv_struct.
> + * Upon success the pointer must be freed via fwctl_put(). Returns NULL on
> + * failure. Returns a 'drv_struct *' on success, NULL on error.
> + */
> +#define fwctl_alloc_device(parent, ops, drv_struct, member) \
> + container_of(_fwctl_alloc_device( \
> + parent, ops, \
> + sizeof(drv_struct) + \
> + BUILD_BUG_ON_ZERO( \
> + offsetof(drv_struct, member))), \
Doesn't that fire a build_bug when the member is at the start of drv_struct?
Or do I have that backwards?
Does container_of() safely handle a NULL?
I'm staring at the definition and can't spot code to do that in 6.10
> + drv_struct, member)
> +
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev
2024-07-26 14:30 ` Jonathan Cameron
@ 2024-07-29 17:30 ` Jason Gunthorpe
2024-07-30 17:15 ` Jonathan Cameron
0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-29 17:30 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Fri, Jul 26, 2024 at 03:30:42PM +0100, Jonathan Cameron wrote:
> Mostly looking at this to get my head around what the details are,
> but whilst I'm reading might as well offer some review comments.
Thanks!
> I'm not a fan of too many mini patches as it makes it harder
> to review rather than easier, but meh, I know others prefer
> it this way. If you are going to do it though, comments
> need to be carefully tracking what they are talking about.
Yeah, I don't like it so much either, but given the debate on this
series I structured it so you can read the commit messages only and
have a pretty good idea what is inside.
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/*
> > + * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
> > + */
> > +#define pr_fmt(fmt) "fwctl: " fmt
> > +#include <linux/fwctl.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/container_of.h>
> > +#include <linux/fs.h>
>
> Trivial: Pick an ordering scheme perhaps as then we know where you'd
> like new headers to be added.
Heh, I think it is random ordered :) But sure lets sort by name,
though linux/fwctl.h does go first. Putting headers first in at least
one c file is a neat trick to ensure they self-compile and don't miss
their own #includess
#define pr_fmt(fmt) "fwctl: " fmt
#include <linux/fwctl.h>
#include <linux/container_of.h>
#include <linux/fs.h>
#include <linux/module.h>
#include <linux/slab.h>
> > +static struct fwctl_device *
> > +_alloc_device(struct device *parent, const struct fwctl_ops *ops, size_t size)
> > +{
> > + struct fwctl_device *fwctl __free(kfree) = kzalloc(size, GFP_KERNEL);
> > + int devnum;
> > +
> > + if (!fwctl)
> > + return NULL;
>
> I'd put a blank line here.
Done
> > +/* Drivers use the fwctl_alloc_device() wrapper */
> > +struct fwctl_device *_fwctl_alloc_device(struct device *parent,
> > + const struct fwctl_ops *ops,
> > + size_t size)
> > +{
> > + struct fwctl_device *fwctl __free(fwctl) =
> > + _alloc_device(parent, ops, size);
> > +
> > + if (!fwctl)
> > + return NULL;
> > +
> > + cdev_init(&fwctl->cdev, &fwctl_fops);
> > + fwctl->cdev.owner = THIS_MODULE;
>
> Owned by fwctl core, not the parent driver? Perhaps a comment on why.
> I guess related to the lifetime being independent of parent driver.
Yes.
/*
* The driver module is protected by fwctl_register/unregister(),
* unregister won't complete until we are done with the driver's module.
*/
fwctl->cdev.owner = THIS_MODULE;
> > +int fwctl_register(struct fwctl_device *fwctl)
> > +{
> > + int ret;
> > +
> > + ret = cdev_device_add(&fwctl->cdev, &fwctl->dev);
> > + if (ret)
> > + return ret;
> > + return 0;
>
> Doesn't look like this ever gets more complex so
>
> return cdev_device_add(...)
>
> If you expect to see more here in near future maybe fair enough
> to keep the handling as is.
Sure, I was expecting more when I wrote it then it turned out there
wasn't
> > + * fwctl_unregister - Unregister a device from the subsystem
> > + * @fwctl: Previously allocated and registered fwctl_device
> > + *
> > + * Undoes fwctl_register(). On return no driver ops will be called. The
> > + * caller must still call fwctl_put() to free the fwctl.
> > + *
> > + * Unregister will return even if userspace still has file descriptors open.
> > + * This will call ops->close_uctx() on any open FDs and after return no driver
> > + * op will be called. The FDs remain open but all fops will return -ENODEV.
>
> Perhaps bring the docs in with the support? I got (briefly) confused
> by the lack of a path to close_uctx() in here.
Okay, that paragraph can be shifted
> > + *
> > + * The design of fwctl allows this sort of disassociation of the driver from the
> > + * subsystem primarily by keeping memory allocations owned by the core subsytem.
> > + * The fwctl_device and fwctl_uctx can both be freed without requiring a driver
> > + * callback. This allows the module to remain unlocked while FDs are open.
> > + */
And this explains the above a 2nd way
> > +void fwctl_unregister(struct fwctl_device *fwctl)
> > +{
> > + cdev_device_del(&fwctl->cdev, &fwctl->dev);
> > +
> > + /*
> > + * The driver module may unload after this returns, the op pointer will
> > + * not be valid.
> > + */
> > + fwctl->ops = NULL;
> I'd bring that in with the logic doing close_uctx() etc as then it will align
> with the comments that I'd also suggest only adding there (patch 2 I think).
Ok
> > +/**
> > + * fwctl_alloc_device - Allocate a fwctl
> > + * @parent: Physical device that provides the FW interface
> > + * @ops: Driver ops to register
> > + * @drv_struct: 'struct driver_fwctl' that holds the struct fwctl_device
> > + * @member: Name of the struct fwctl_device in @drv_struct
> > + *
> > + * This allocates and initializes the fwctl_device embedded in the drv_struct.
> > + * Upon success the pointer must be freed via fwctl_put(). Returns NULL on
> > + * failure. Returns a 'drv_struct *' on success, NULL on error.
> > + */
> > +#define fwctl_alloc_device(parent, ops, drv_struct, member) \
> > + container_of(_fwctl_alloc_device( \
> > + parent, ops, \
> > + sizeof(drv_struct) + \
> > + BUILD_BUG_ON_ZERO( \
> > + offsetof(drv_struct, member))), \
> Doesn't that fire a build_bug when the member is at the start of drv_struct?
> Or do I have that backwards?
BUILD_BUG_ON(true) == failure, evaluates to void
BUILD_BUG_ON_ZERO(true) == fails, evaluates to 0
BUILD_BUG_ON_ZERO(false) == false, evaluates to 0
It is a bit confusing name, it is not ON_ZERO it is BUG_ON return ZERO
> Does container_of() safely handle a NULL?
Generally no, nor does it handle ERR_PTR, but it does work for both if
the offset is 0.
The BUILD_BUG guarentees the 0 offset both so that the casting inside
_fwctl_alloc_device() works and we can use safely use container_of()
to enforce the type check.
What do you think about writing it like this instead:
#define fwctl_alloc_device(parent, ops, drv_struct, member) \
({ \
static_assert(__same_type(struct fwctl_device, \
((drv_struct *)NULL)->member)); \
static_assert(offsetof(drv_struct, member) == 0); \
(drv_struct *)_fwctl_alloc_device(parent, ops, \
sizeof(drv_struct)); \
})
?
In some ways I like it better..
Thanks,
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev
2024-07-29 17:30 ` Jason Gunthorpe
@ 2024-07-30 17:15 ` Jonathan Cameron
0 siblings, 0 replies; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-30 17:15 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
>
> > > +/**
> > > + * fwctl_alloc_device - Allocate a fwctl
> > > + * @parent: Physical device that provides the FW interface
> > > + * @ops: Driver ops to register
> > > + * @drv_struct: 'struct driver_fwctl' that holds the struct fwctl_device
> > > + * @member: Name of the struct fwctl_device in @drv_struct
> > > + *
> > > + * This allocates and initializes the fwctl_device embedded in the drv_struct.
> > > + * Upon success the pointer must be freed via fwctl_put(). Returns NULL on
> > > + * failure. Returns a 'drv_struct *' on success, NULL on error.
> > > + */
> > > +#define fwctl_alloc_device(parent, ops, drv_struct, member) \
> > > + container_of(_fwctl_alloc_device( \
> > > + parent, ops, \
> > > + sizeof(drv_struct) + \
> > > + BUILD_BUG_ON_ZERO( \
> > > + offsetof(drv_struct, member))), \
> > Doesn't that fire a build_bug when the member is at the start of drv_struct?
> > Or do I have that backwards?
>
> BUILD_BUG_ON(true) == failure, evaluates to void
> BUILD_BUG_ON_ZERO(true) == fails, evaluates to 0
> BUILD_BUG_ON_ZERO(false) == false, evaluates to 0
>
> It is a bit confusing name, it is not ON_ZERO it is BUG_ON return ZERO
Ah. That indeed got me. ouch.
>
> > Does container_of() safely handle a NULL?
>
> Generally no, nor does it handle ERR_PTR, but it does work for both if
> the offset is 0.
Ah. Good point, I'd neglected the zero offset meaning it is really
just a fancy pointer type cast.
>
> The BUILD_BUG guarentees the 0 offset both so that the casting inside
> _fwctl_alloc_device() works and we can use safely use container_of()
> to enforce the type check.
>
> What do you think about writing it like this instead:
>
> #define fwctl_alloc_device(parent, ops, drv_struct, member) \
> ({ \
> static_assert(__same_type(struct fwctl_device, \
> ((drv_struct *)NULL)->member)); \
> static_assert(offsetof(drv_struct, member) == 0); \
> (drv_struct *)_fwctl_alloc_device(parent, ops, \
> sizeof(drv_struct)); \
> })
>
> ?
>
> In some ways I like it better..
Seems more readable to me and avoids entertaining corners of the previous approach.
Jonathan
>
> Thanks,
> Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
2024-06-24 22:47 ` [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-07-26 15:01 ` Jonathan Cameron
2024-08-06 7:36 ` Daniel Vetter
2024-06-24 22:47 ` [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device Jason Gunthorpe
` (6 subsequent siblings)
8 siblings, 2 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
Each file descriptor gets a chunk of per-FD driver specific context that
allows the driver to attach a device specific struct to. The core code
takes care of the memory lifetime for this structure.
The ioctl dispatch and design is based on what was built for iommufd. The
ioctls have a struct which has a combined in/out behavior with a typical
'zero pad' scheme for future extension and backwards compatibility.
Like iommufd some shared logic does most of the ioctl marshalling and
compatibility work and tables diatches to some function pointers for
each unique iotcl.
This approach has proven to work quite well in the iommufd and rdma
subsystems.
Allocate an ioctl number space for the subsystem.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
.../userspace-api/ioctl/ioctl-number.rst | 1 +
MAINTAINERS | 1 +
drivers/fwctl/main.c | 124 +++++++++++++++++-
include/linux/fwctl.h | 31 +++++
include/uapi/fwctl/fwctl.h | 41 ++++++
5 files changed, 196 insertions(+), 2 deletions(-)
create mode 100644 include/uapi/fwctl/fwctl.h
diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
index a141e8e65c5d3a..4d91c5a20b98c8 100644
--- a/Documentation/userspace-api/ioctl/ioctl-number.rst
+++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
@@ -324,6 +324,7 @@ Code Seq# Include File Comments
0x97 00-7F fs/ceph/ioctl.h Ceph file system
0x99 00-0F 537-Addinboard driver
<mailto:buk@buks.ipn.de>
+0x9A 00-0F include/uapi/fwctl/fwctl.h
0xA0 all linux/sdp/sdp.h Industrial Device Project
<mailto:kenji@bitgate.com>
0xA1 0 linux/vtpm_proxy.h TPM Emulator Proxy Driver
diff --git a/MAINTAINERS b/MAINTAINERS
index aa7a760d12f8ef..2090009a6ae98a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9083,6 +9083,7 @@ S: Maintained
F: Documentation/userspace-api/fwctl.rst
F: drivers/fwctl/
F: include/linux/fwctl.h
+F: include/uapi/fwctl/
GALAXYCORE GC0308 CAMERA SENSOR DRIVER
M: Sebastian Reichel <sre@kernel.org>
diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
index 6e9bf15c743b5c..6872c01d5c62e8 100644
--- a/drivers/fwctl/main.c
+++ b/drivers/fwctl/main.c
@@ -9,26 +9,131 @@
#include <linux/container_of.h>
#include <linux/fs.h>
+#include <uapi/fwctl/fwctl.h>
+
enum {
FWCTL_MAX_DEVICES = 256,
};
static dev_t fwctl_dev;
static DEFINE_IDA(fwctl_ida);
+struct fwctl_ucmd {
+ struct fwctl_uctx *uctx;
+ void __user *ubuffer;
+ void *cmd;
+ u32 user_size;
+};
+
+/* On stack memory for the ioctl structs */
+union ucmd_buffer {
+};
+
+struct fwctl_ioctl_op {
+ unsigned int size;
+ unsigned int min_size;
+ unsigned int ioctl_num;
+ int (*execute)(struct fwctl_ucmd *ucmd);
+};
+
+#define IOCTL_OP(_ioctl, _fn, _struct, _last) \
+ [_IOC_NR(_ioctl) - FWCTL_CMD_BASE] = { \
+ .size = sizeof(_struct) + \
+ BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) < \
+ sizeof(_struct)), \
+ .min_size = offsetofend(_struct, _last), \
+ .ioctl_num = _ioctl, \
+ .execute = _fn, \
+ }
+static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
+};
+
+static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
+ unsigned long arg)
+{
+ struct fwctl_uctx *uctx = filp->private_data;
+ const struct fwctl_ioctl_op *op;
+ struct fwctl_ucmd ucmd = {};
+ union ucmd_buffer buf;
+ unsigned int nr;
+ int ret;
+
+ nr = _IOC_NR(cmd);
+ if ((nr - FWCTL_CMD_BASE) >= ARRAY_SIZE(fwctl_ioctl_ops))
+ return -ENOIOCTLCMD;
+ op = &fwctl_ioctl_ops[nr - FWCTL_CMD_BASE];
+ if (op->ioctl_num != cmd)
+ return -ENOIOCTLCMD;
+
+ ucmd.uctx = uctx;
+ ucmd.cmd = &buf;
+ ucmd.ubuffer = (void __user *)arg;
+ ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
+ if (ret)
+ return ret;
+
+ if (ucmd.user_size < op->min_size)
+ return -EINVAL;
+
+ ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
+ ucmd.user_size);
+ if (ret)
+ return ret;
+
+ guard(rwsem_read)(&uctx->fwctl->registration_lock);
+ if (!uctx->fwctl->ops)
+ return -ENODEV;
+ return op->execute(&ucmd);
+}
+
static int fwctl_fops_open(struct inode *inode, struct file *filp)
{
struct fwctl_device *fwctl =
container_of(inode->i_cdev, struct fwctl_device, cdev);
+ int ret;
+
+ guard(rwsem_read)(&fwctl->registration_lock);
+ if (!fwctl->ops)
+ return -ENODEV;
+
+ struct fwctl_uctx *uctx __free(kfree) =
+ kzalloc(fwctl->ops->uctx_size, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
+ if (!uctx)
+ return -ENOMEM;
+
+ uctx->fwctl = fwctl;
+ ret = fwctl->ops->open_uctx(uctx);
+ if (ret)
+ return ret;
+
+ scoped_guard(mutex, &fwctl->uctx_list_lock) {
+ list_add_tail(&uctx->uctx_list_entry, &fwctl->uctx_list);
+ }
get_device(&fwctl->dev);
- filp->private_data = fwctl;
+ filp->private_data = no_free_ptr(uctx);
return 0;
}
+static void fwctl_destroy_uctx(struct fwctl_uctx *uctx)
+{
+ lockdep_assert_held(&uctx->fwctl->uctx_list_lock);
+ list_del(&uctx->uctx_list_entry);
+ uctx->fwctl->ops->close_uctx(uctx);
+}
+
static int fwctl_fops_release(struct inode *inode, struct file *filp)
{
- struct fwctl_device *fwctl = filp->private_data;
+ struct fwctl_uctx *uctx = filp->private_data;
+ struct fwctl_device *fwctl = uctx->fwctl;
+ scoped_guard(rwsem_read, &fwctl->registration_lock) {
+ if (fwctl->ops) {
+ guard(mutex)(&fwctl->uctx_list_lock);
+ fwctl_destroy_uctx(uctx);
+ }
+ }
+
+ kfree(uctx);
fwctl_put(fwctl);
return 0;
}
@@ -37,6 +142,7 @@ static const struct file_operations fwctl_fops = {
.owner = THIS_MODULE,
.open = fwctl_fops_open,
.release = fwctl_fops_release,
+ .unlocked_ioctl = fwctl_fops_ioctl,
};
static void fwctl_device_release(struct device *device)
@@ -45,6 +151,7 @@ static void fwctl_device_release(struct device *device)
container_of(device, struct fwctl_device, dev);
ida_free(&fwctl_ida, fwctl->dev.devt - fwctl_dev);
+ mutex_destroy(&fwctl->uctx_list_lock);
kfree(fwctl);
}
@@ -69,6 +176,9 @@ _alloc_device(struct device *parent, const struct fwctl_ops *ops, size_t size)
return NULL;
fwctl->dev.class = &fwctl_class;
fwctl->dev.parent = parent;
+ init_rwsem(&fwctl->registration_lock);
+ mutex_init(&fwctl->uctx_list_lock);
+ INIT_LIST_HEAD(&fwctl->uctx_list);
devnum = ida_alloc_max(&fwctl_ida, FWCTL_MAX_DEVICES - 1, GFP_KERNEL);
if (devnum < 0)
@@ -137,8 +247,18 @@ EXPORT_SYMBOL_NS_GPL(fwctl_register, FWCTL);
*/
void fwctl_unregister(struct fwctl_device *fwctl)
{
+ struct fwctl_uctx *uctx;
+
cdev_device_del(&fwctl->cdev, &fwctl->dev);
+ /* Disable and free the driver's resources for any still open FDs. */
+ guard(rwsem_write)(&fwctl->registration_lock);
+ guard(mutex)(&fwctl->uctx_list_lock);
+ while ((uctx = list_first_entry_or_null(&fwctl->uctx_list,
+ struct fwctl_uctx,
+ uctx_list_entry)))
+ fwctl_destroy_uctx(uctx);
+
/*
* The driver module may unload after this returns, the op pointer will
* not be valid.
diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
index ef4eaa87c945e4..1d9651de92fc19 100644
--- a/include/linux/fwctl.h
+++ b/include/linux/fwctl.h
@@ -11,7 +11,20 @@
struct fwctl_device;
struct fwctl_uctx;
+/**
+ * struct fwctl_ops - Driver provided operations
+ * @uctx_size: The size of the fwctl_uctx struct to allocate. The first
+ * bytes of this memory will be a fwctl_uctx. The driver can use the
+ * remaining bytes as its private memory.
+ * @open_uctx: Called when a file descriptor is opened before the uctx is ever
+ * used.
+ * @close_uctx: Called when the uctx is destroyed, usually when the FD is
+ * closed.
+ */
struct fwctl_ops {
+ size_t uctx_size;
+ int (*open_uctx)(struct fwctl_uctx *uctx);
+ void (*close_uctx)(struct fwctl_uctx *uctx);
};
/**
@@ -26,6 +39,10 @@ struct fwctl_device {
struct device dev;
/* private: */
struct cdev cdev;
+
+ struct rw_semaphore registration_lock;
+ struct mutex uctx_list_lock;
+ struct list_head uctx_list;
const struct fwctl_ops *ops;
};
@@ -65,4 +82,18 @@ DEFINE_FREE(fwctl, struct fwctl_device *, if (_T) fwctl_put(_T));
int fwctl_register(struct fwctl_device *fwctl);
void fwctl_unregister(struct fwctl_device *fwctl);
+/**
+ * struct fwctl_uctx - Per user FD context
+ * @fwctl: fwctl instance that owns the context
+ *
+ * Every FD opened by userspace will get a unique context allocation. Any driver
+ * private data will follow immediately after.
+ */
+struct fwctl_uctx {
+ struct fwctl_device *fwctl;
+ /* private: */
+ /* Head at fwctl_device::uctx_list */
+ struct list_head uctx_list_entry;
+};
+
#endif
diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
new file mode 100644
index 00000000000000..0bdce95b6d69d9
--- /dev/null
+++ b/include/uapi/fwctl/fwctl.h
@@ -0,0 +1,41 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
+ */
+#ifndef _UAPI_FWCTL_H
+#define _UAPI_FWCTL_H
+
+#include <linux/types.h>
+#include <linux/ioctl.h>
+
+#define FWCTL_TYPE 0x9A
+
+/**
+ * DOC: General ioctl format
+ *
+ * The ioctl interface follows a general format to allow for extensibility. Each
+ * ioctl is passed in a structure pointer as the argument providing the size of
+ * the structure in the first u32. The kernel checks that any structure space
+ * beyond what it understands is 0. This allows userspace to use the backward
+ * compatible portion while consistently using the newer, larger, structures.
+ *
+ * ioctls use a standard meaning for common errnos:
+ *
+ * - ENOTTY: The IOCTL number itself is not supported at all
+ * - E2BIG: The IOCTL number is supported, but the provided structure has
+ * non-zero in a part the kernel does not understand.
+ * - EOPNOTSUPP: The IOCTL number is supported, and the structure is
+ * understood, however a known field has a value the kernel does not
+ * understand or support.
+ * - EINVAL: Everything about the IOCTL was understood, but a field is not
+ * correct.
+ * - ENOMEM: Out of memory.
+ * - ENODEV: The underlying device has been hot-unplugged and the FD is
+ * orphaned.
+ *
+ * As well as additional errnos, within specific ioctls.
+ */
+enum {
+ FWCTL_CMD_BASE = 0,
+};
+
+#endif
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device
2024-06-24 22:47 ` [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device Jason Gunthorpe
@ 2024-07-26 15:01 ` Jonathan Cameron
2024-07-29 17:05 ` Jason Gunthorpe
2024-08-06 7:36 ` Daniel Vetter
1 sibling, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-26 15:01 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, 24 Jun 2024 19:47:26 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> Each file descriptor gets a chunk of per-FD driver specific context that
> allows the driver to attach a device specific struct to. The core code
> takes care of the memory lifetime for this structure.
>
> The ioctl dispatch and design is based on what was built for iommufd. The
> ioctls have a struct which has a combined in/out behavior with a typical
> 'zero pad' scheme for future extension and backwards compatibility.
>
> Like iommufd some shared logic does most of the ioctl marshalling and
> compatibility work and tables diatches to some function pointers for
> each unique iotcl.
>
> This approach has proven to work quite well in the iommufd and rdma
> subsystems.
>
> Allocate an ioctl number space for the subsystem.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Minor stuff inline.
> M: Sebastian Reichel <sre@kernel.org>
> diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
> index 6e9bf15c743b5c..6872c01d5c62e8 100644
> --- a/drivers/fwctl/main.c
> +++ b/drivers/fwctl/main.c
> +
> +struct fwctl_ioctl_op {
> + unsigned int size;
> + unsigned int min_size;
> + unsigned int ioctl_num;
> + int (*execute)(struct fwctl_ucmd *ucmd);
> +};
> +
> +#define IOCTL_OP(_ioctl, _fn, _struct, _last) \
> + [_IOC_NR(_ioctl) - FWCTL_CMD_BASE] = { \
If this is always zero indexed, maybe just drop the - FWCTL_CMD_BASE here
and elsewhere? Maybe through in a BUILD_BUG to confirm it is always 0.
> + .size = sizeof(_struct) + \
> + BUILD_BUG_ON_ZERO(sizeof(union ucmd_buffer) < \
> + sizeof(_struct)), \
> + .min_size = offsetofend(_struct, _last), \
> + .ioctl_num = _ioctl, \
> + .execute = _fn, \
> + }
> +static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
> +};
> +
> +static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
> + unsigned long arg)
> +{
> + struct fwctl_uctx *uctx = filp->private_data;
> + const struct fwctl_ioctl_op *op;
> + struct fwctl_ucmd ucmd = {};
> + union ucmd_buffer buf;
> + unsigned int nr;
> + int ret;
> +
> + nr = _IOC_NR(cmd);
> + if ((nr - FWCTL_CMD_BASE) >= ARRAY_SIZE(fwctl_ioctl_ops))
> + return -ENOIOCTLCMD;
I'd add a blank line here as two unconnected set and error check
blocks.
> + op = &fwctl_ioctl_ops[nr - FWCTL_CMD_BASE];
> + if (op->ioctl_num != cmd)
> + return -ENOIOCTLCMD;
> +
> + ucmd.uctx = uctx;
> + ucmd.cmd = &buf;
> + ucmd.ubuffer = (void __user *)arg;
> + ret = get_user(ucmd.user_size, (u32 __user *)ucmd.ubuffer);
> + if (ret)
> + return ret;
> +
> + if (ucmd.user_size < op->min_size)
> + return -EINVAL;
> +
> + ret = copy_struct_from_user(ucmd.cmd, op->size, ucmd.ubuffer,
> + ucmd.user_size);
> + if (ret)
> + return ret;
> +
> + guard(rwsem_read)(&uctx->fwctl->registration_lock);
> + if (!uctx->fwctl->ops)
> + return -ENODEV;
> + return op->execute(&ucmd);
> +}
> +
> static int fwctl_fops_open(struct inode *inode, struct file *filp)
> {
> struct fwctl_device *fwctl =
> container_of(inode->i_cdev, struct fwctl_device, cdev);
> + int ret;
> +
> + guard(rwsem_read)(&fwctl->registration_lock);
> + if (!fwctl->ops)
> + return -ENODEV;
> +
> + struct fwctl_uctx *uctx __free(kfree) =
> + kzalloc(fwctl->ops->uctx_size, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
GFP_KERNEL_ACCOUNT seems to include GFP_KERNEL already.
Did I miss some racing change?
> + if (!uctx)
> + return -ENOMEM;
> +
> + uctx->fwctl = fwctl;
> + ret = fwctl->ops->open_uctx(uctx);
> + if (ret)
> + return ret;
> +
> + scoped_guard(mutex, &fwctl->uctx_list_lock) {
> + list_add_tail(&uctx->uctx_list_entry, &fwctl->uctx_list);
> + }
I guess more may come later but do we need {}?
>
> get_device(&fwctl->dev);
> - filp->private_data = fwctl;
> + filp->private_data = no_free_ptr(uctx);
> return 0;
> }
>
> +static void fwctl_destroy_uctx(struct fwctl_uctx *uctx)
> +{
> + lockdep_assert_held(&uctx->fwctl->uctx_list_lock);
> + list_del(&uctx->uctx_list_entry);
> + uctx->fwctl->ops->close_uctx(uctx);
> +}
> +
> static int fwctl_fops_release(struct inode *inode, struct file *filp)
> {
> - struct fwctl_device *fwctl = filp->private_data;
> + struct fwctl_uctx *uctx = filp->private_data;
> + struct fwctl_device *fwctl = uctx->fwctl;
>
> + scoped_guard(rwsem_read, &fwctl->registration_lock) {
> + if (fwctl->ops) {
Maybe a comment on when this path happens to help the reader
along. (when the file is closed and device is still alive).
Otherwise was cleaned up already in fwctl_unregister()
> + guard(mutex)(&fwctl->uctx_list_lock);
> + fwctl_destroy_uctx(uctx);
> + }
> + }
> +
> + kfree(uctx);
> fwctl_put(fwctl);
> return 0;
> }
> @@ -37,6 +142,7 @@ static const struct file_operations fwctl_fops = {
> .owner = THIS_MODULE,
> .open = fwctl_fops_open,
> .release = fwctl_fops_release,
> + .unlocked_ioctl = fwctl_fops_ioctl,
> };
>
> devnum = ida_alloc_max(&fwctl_ida, FWCTL_MAX_DEVICES - 1, GFP_KERNEL);
> if (devnum < 0)
> @@ -137,8 +247,18 @@ EXPORT_SYMBOL_NS_GPL(fwctl_register, FWCTL);
> */
> void fwctl_unregister(struct fwctl_device *fwctl)
> {
> + struct fwctl_uctx *uctx;
> +
> cdev_device_del(&fwctl->cdev, &fwctl->dev);
>
> + /* Disable and free the driver's resources for any still open FDs. */
> + guard(rwsem_write)(&fwctl->registration_lock);
> + guard(mutex)(&fwctl->uctx_list_lock);
> + while ((uctx = list_first_entry_or_null(&fwctl->uctx_list,
> + struct fwctl_uctx,
> + uctx_list_entry)))
> + fwctl_destroy_uctx(uctx);
> +
Obviously it's a little more heavy weight but I'd just use
list_for_each_entry_safe()
Less effort for reviewers than consider the custom iteration
you are doing instead.
> /*
> * The driver module may unload after this returns, the op pointer will
> * not be valid.
> diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
> index ef4eaa87c945e4..1d9651de92fc19 100644
> --- a/include/linux/fwctl.h
> +++ b/include/linux/fwctl.h
> @@ -11,7 +11,20 @@
> struct fwctl_device;
> struct fwctl_uctx;
>
> +/**
> + * struct fwctl_ops - Driver provided operations
> + * @uctx_size: The size of the fwctl_uctx struct to allocate. The first
> + * bytes of this memory will be a fwctl_uctx. The driver can use the
> + * remaining bytes as its private memory.
> + * @open_uctx: Called when a file descriptor is opened before the uctx is ever
> + * used.
> + * @close_uctx: Called when the uctx is destroyed, usually when the FD is
> + * closed.
> + */
> struct fwctl_ops {
> + size_t uctx_size;
> + int (*open_uctx)(struct fwctl_uctx *uctx);
> + void (*close_uctx)(struct fwctl_uctx *uctx);
> };
>
> /**
> @@ -26,6 +39,10 @@ struct fwctl_device {
> struct device dev;
> /* private: */
> struct cdev cdev;
> +
> + struct rw_semaphore registration_lock;
> + struct mutex uctx_list_lock;
Even for private locks, a scope statement would
be good to have.
> + struct list_head uctx_list;
> const struct fwctl_ops *ops;
> };
> #endif
> diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
> new file mode 100644
> index 00000000000000..0bdce95b6d69d9
> --- /dev/null
> +++ b/include/uapi/fwctl/fwctl.h
> @@ -0,0 +1,41 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/* Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
> + */
> +#ifndef _UAPI_FWCTL_H
> +#define _UAPI_FWCTL_H
> +
> +#include <linux/types.h>
Not used yet.
> +#include <linux/ioctl.h>
Arguably nor is this, but at least this related to the code
here.
> +
> +#define FWCTL_TYPE 0x9A
> +
> +/**
> + * DOC: General ioctl format
> + *
> + * The ioctl interface follows a general format to allow for extensibility. Each
> + * ioctl is passed in a structure pointer as the argument providing the size of
> + * the structure in the first u32. The kernel checks that any structure space
> + * beyond what it understands is 0. This allows userspace to use the backward
> + * compatible portion while consistently using the newer, larger, structures.
Is that particularly helpful? Userspace needs to know not to put anything in
those fields, not hard for it to also know what the size it should send is?
The two will change together.
> + *
> + * ioctls use a standard meaning for common errnos:
> + *
> + * - ENOTTY: The IOCTL number itself is not supported at all
> + * - E2BIG: The IOCTL number is supported, but the provided structure has
> + * non-zero in a part the kernel does not understand.
> + * - EOPNOTSUPP: The IOCTL number is supported, and the structure is
> + * understood, however a known field has a value the kernel does not
> + * understand or support.
> + * - EINVAL: Everything about the IOCTL was understood, but a field is not
> + * correct.
> + * - ENOMEM: Out of memory.
> + * - ENODEV: The underlying device has been hot-unplugged and the FD is
> + * orphaned.
> + *
> + * As well as additional errnos, within specific ioctls.
> + */
> +enum {
> + FWCTL_CMD_BASE = 0,
> +};
> +
> +#endif
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device
2024-07-26 15:01 ` Jonathan Cameron
@ 2024-07-29 17:05 ` Jason Gunthorpe
2024-07-30 17:28 ` Jonathan Cameron
0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-29 17:05 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Fri, Jul 26, 2024 at 04:01:57PM +0100, Jonathan Cameron wrote:
> > +struct fwctl_ioctl_op {
> > + unsigned int size;
> > + unsigned int min_size;
> > + unsigned int ioctl_num;
> > + int (*execute)(struct fwctl_ucmd *ucmd);
> > +};
> > +
> > +#define IOCTL_OP(_ioctl, _fn, _struct, _last) \
> > + [_IOC_NR(_ioctl) - FWCTL_CMD_BASE] = { \
>
> If this is always zero indexed, maybe just drop the - FWCTL_CMD_BASE here
> and elsewhere? Maybe through in a BUILD_BUG to confirm it is always 0.
I left it like this in case someone had different ideas for the number
space (iommufd uses a non 0 base also). I think either is fine, and I
slightly prefer keeping it rather than a static_assert.
> > +static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
> > + unsigned long arg)
> > +{
> > + struct fwctl_uctx *uctx = filp->private_data;
> > + const struct fwctl_ioctl_op *op;
> > + struct fwctl_ucmd ucmd = {};
> > + union ucmd_buffer buf;
> > + unsigned int nr;
> > + int ret;
> > +
> > + nr = _IOC_NR(cmd);
> > + if ((nr - FWCTL_CMD_BASE) >= ARRAY_SIZE(fwctl_ioctl_ops))
> > + return -ENOIOCTLCMD;
>
> I'd add a blank line here as two unconnected set and error check
> blocks.
Done
> > static int fwctl_fops_open(struct inode *inode, struct file *filp)
> > {
> > struct fwctl_device *fwctl =
> > container_of(inode->i_cdev, struct fwctl_device, cdev);
> > + int ret;
> > +
> > + guard(rwsem_read)(&fwctl->registration_lock);
> > + if (!fwctl->ops)
> > + return -ENODEV;
> > +
> > + struct fwctl_uctx *uctx __free(kfree) =
> > + kzalloc(fwctl->ops->uctx_size, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
>
> GFP_KERNEL_ACCOUNT seems to include GFP_KERNEL already.
> Did I miss some racing change?
I'm sure I copy and pasted this carelessly from someplace else
> > + if (!uctx)
> > + return -ENOMEM;
> > +
> > + uctx->fwctl = fwctl;
> > + ret = fwctl->ops->open_uctx(uctx);
> > + if (ret)
> > + return ret;
> > +
> > + scoped_guard(mutex, &fwctl->uctx_list_lock) {
> > + list_add_tail(&uctx->uctx_list_entry, &fwctl->uctx_list);
> > + }
>
> I guess more may come later but do we need {}?
I guessed the extra {} would be style guide for this construct?
> > static int fwctl_fops_release(struct inode *inode, struct file *filp)
> > {
> > - struct fwctl_device *fwctl = filp->private_data;
> > + struct fwctl_uctx *uctx = filp->private_data;
> > + struct fwctl_device *fwctl = uctx->fwctl;
> >
> > + scoped_guard(rwsem_read, &fwctl->registration_lock) {
> > + if (fwctl->ops) {
>
> Maybe a comment on when this path happens to help the reader
> along. (when the file is closed and device is still alive).
> Otherwise was cleaned up already in fwctl_unregister()
scoped_guard(rwsem_read, &fwctl->registration_lock) {
/*
* fwctl_unregister() has already removed the driver and
* destroyed the uctx.
*/
if (fwctl->ops) {
> > void fwctl_unregister(struct fwctl_device *fwctl)
> > {
> > + struct fwctl_uctx *uctx;
> > +
> > cdev_device_del(&fwctl->cdev, &fwctl->dev);
> >
> > + /* Disable and free the driver's resources for any still open FDs. */
> > + guard(rwsem_write)(&fwctl->registration_lock);
> > + guard(mutex)(&fwctl->uctx_list_lock);
> > + while ((uctx = list_first_entry_or_null(&fwctl->uctx_list,
> > + struct fwctl_uctx,
> > + uctx_list_entry)))
> > + fwctl_destroy_uctx(uctx);
> > +
>
> Obviously it's a little more heavy weight but I'd just use
> list_for_each_entry_safe()
>
> Less effort for reviewers than consider the custom iteration
> you are doing instead.
For these constructs the goal is the make the list empty, it is a
tinsy bit safer/clearer to drive the list to empty purposefully rather
than iterate over it and hope it is empty once done.
However there is no possible way that list_for_each_entry_safe() would
be an unsafe construct here. I can change it if you feel strongly
> > @@ -26,6 +39,10 @@ struct fwctl_device {
> > struct device dev;
> > /* private: */
> > struct cdev cdev;
> > +
> > + struct rw_semaphore registration_lock;
> > + struct mutex uctx_list_lock;
>
> Even for private locks, a scope statement would
> be good to have.
Like so?
/*
* Protect ops, held for write when ops becomes NULL during unregister,
* held for read whenver ops is loaded or an ops function is running.
*/
struct rw_semaphore registration_lock;
/* Protect uctx_list */
struct mutex uctx_list_lock;
> > +#ifndef _UAPI_FWCTL_H
> > +#define _UAPI_FWCTL_H
> > +
> > +#include <linux/types.h>
>
> Not used yet.
>
> > +#include <linux/ioctl.h>
>
> Arguably nor is this, but at least this related to the code
> here.
Sure, lets move them
> > +/**
> > + * DOC: General ioctl format
> > + *
> > + * The ioctl interface follows a general format to allow for extensibility. Each
> > + * ioctl is passed in a structure pointer as the argument providing the size of
> > + * the structure in the first u32. The kernel checks that any structure space
> > + * beyond what it understands is 0. This allows userspace to use the backward
> > + * compatible portion while consistently using the newer, larger, structures.
>
> Is that particularly helpful? Userspace needs to know not to put anything in
> those fields, not hard for it to also know what the size it should send is?
> The two will change together.
It is very helpful for a practical userspace.
Lets say we have an ioctl struct:
struct fwctl_info {
__u32 size;
__u32 flags;
__u32 out_device_type;
__u32 device_data_len;
__aligned_u64 out_device_data;
};
And the basic userspace pattern is:
struct fwctl_info info = {.size = sizeof(info), ...);
ioctl(fd, FWCTL_INFO, &info);
This works today and generates the 24 byte command.
Tomorrow the kernel adds a new member:
struct fwctl_info {
__u32 size;
__u32 flags;
__u32 out_device_type;
__u32 device_data_len;
__aligned_u64 out_device_data;
__aligned_u64 new_thing;
};
Current builds of the userpace use a 24 byte command. A new kernel
will see the 24 bytes and behave as before.
When I recompile the userspace with the updated header it will issue a
32 byte command with no source change.
Old kernel will see a 32 byte command with the trailing bytes it
doesn't understand as 0 and keep working.
The new kernel will see the new_thing bytes are zero and behave the
same as before.
If then the userspace decides to set new_thing the old kernel will
stop working. Userspace can use some 'try and fail' approach to try
again with new_thing = 0.
It gives a whole bunch of easy paths for userspace, otherwise
userspace has to be very careful to match the size of the struct to
the ABI it is targetting. Realistically nobody will do that right.
Thanks,
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device
2024-07-29 17:05 ` Jason Gunthorpe
@ 2024-07-30 17:28 ` Jonathan Cameron
2024-08-01 13:05 ` Jason Gunthorpe
0 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-30 17:28 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, 29 Jul 2024 14:05:53 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Fri, Jul 26, 2024 at 04:01:57PM +0100, Jonathan Cameron wrote:
>
> > > +struct fwctl_ioctl_op {
> > > + unsigned int size;
> > > + unsigned int min_size;
> > > + unsigned int ioctl_num;
> > > + int (*execute)(struct fwctl_ucmd *ucmd);
> > > +};
> > > +
> > > +#define IOCTL_OP(_ioctl, _fn, _struct, _last) \
> > > + [_IOC_NR(_ioctl) - FWCTL_CMD_BASE] = { \
> >
> > If this is always zero indexed, maybe just drop the - FWCTL_CMD_BASE here
> > and elsewhere? Maybe through in a BUILD_BUG to confirm it is always 0.
>
> I left it like this in case someone had different ideas for the number
> space (iommufd uses a non 0 base also). I think either is fine, and I
> slightly prefer keeping it rather than a static_assert.
Ok. Feels a little messy to me. But fair enough I guess.
> > > + if (!uctx)
> > > + return -ENOMEM;
> > > +
> > > + uctx->fwctl = fwctl;
> > > + ret = fwctl->ops->open_uctx(uctx);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + scoped_guard(mutex, &fwctl->uctx_list_lock) {
> > > + list_add_tail(&uctx->uctx_list_entry, &fwctl->uctx_list);
> > > + }
> >
> > I guess more may come later but do we need {}?
>
> I guessed the extra {} would be style guide for this construct?
Maybe. Not seen any statements on that yet, and it's becoming
quite common.
>
> > > static int fwctl_fops_release(struct inode *inode, struct file *filp)
> > > {
> > > - struct fwctl_device *fwctl = filp->private_data;
> > > + struct fwctl_uctx *uctx = filp->private_data;
> > > + struct fwctl_device *fwctl = uctx->fwctl;
> > >
> > > + scoped_guard(rwsem_read, &fwctl->registration_lock) {
> > > + if (fwctl->ops) {
> >
> > Maybe a comment on when this path happens to help the reader
> > along. (when the file is closed and device is still alive).
> > Otherwise was cleaned up already in fwctl_unregister()
>
> scoped_guard(rwsem_read, &fwctl->registration_lock) {
> /*
> * fwctl_unregister() has already removed the driver and
> * destroyed the uctx.
> */
> if (fwctl->ops) {
>
Good.
> > > void fwctl_unregister(struct fwctl_device *fwctl)
> > > {
> > > + struct fwctl_uctx *uctx;
> > > +
> > > cdev_device_del(&fwctl->cdev, &fwctl->dev);
> > >
> > > + /* Disable and free the driver's resources for any still open FDs. */
> > > + guard(rwsem_write)(&fwctl->registration_lock);
> > > + guard(mutex)(&fwctl->uctx_list_lock);
> > > + while ((uctx = list_first_entry_or_null(&fwctl->uctx_list,
> > > + struct fwctl_uctx,
> > > + uctx_list_entry)))
> > > + fwctl_destroy_uctx(uctx);
> > > +
> >
> > Obviously it's a little more heavy weight but I'd just use
> > list_for_each_entry_safe()
> >
> > Less effort for reviewers than consider the custom iteration
> > you are doing instead.
>
> For these constructs the goal is the make the list empty, it is a
> tinsy bit safer/clearer to drive the list to empty purposefully rather
> than iterate over it and hope it is empty once done.
>
> However there is no possible way that list_for_each_entry_safe() would
> be an unsafe construct here. I can change it if you feel strongly
Meh. You get to maintain this if it flies, so your choice.
>
> > > @@ -26,6 +39,10 @@ struct fwctl_device {
> > > struct device dev;
> > > /* private: */
> > > struct cdev cdev;
> > > +
> > > + struct rw_semaphore registration_lock;
> > > + struct mutex uctx_list_lock;
> >
> > Even for private locks, a scope statement would
> > be good to have.
>
> Like so?
>
> /*
> * Protect ops, held for write when ops becomes NULL during unregister,
> * held for read whenver ops is loaded or an ops function is running.
That does the job nicely.
> */
> struct rw_semaphore registration_lock;
> /* Protect uctx_list */
> struct mutex uctx_list_lock;
>
> > > +/**
> > > + * DOC: General ioctl format
> > > + *
> > > + * The ioctl interface follows a general format to allow for extensibility. Each
> > > + * ioctl is passed in a structure pointer as the argument providing the size of
> > > + * the structure in the first u32. The kernel checks that any structure space
> > > + * beyond what it understands is 0. This allows userspace to use the backward
> > > + * compatible portion while consistently using the newer, larger, structures.
> >
> > Is that particularly helpful? Userspace needs to know not to put anything in
> > those fields, not hard for it to also know what the size it should send is?
> > The two will change together.
>
> It is very helpful for a practical userspace.
>
> Lets say we have an ioctl struct:
>
> struct fwctl_info {
> __u32 size;
> __u32 flags;
> __u32 out_device_type;
> __u32 device_data_len;
> __aligned_u64 out_device_data;
> };
>
> And the basic userspace pattern is:
>
> struct fwctl_info info = {.size = sizeof(info), ...);
> ioctl(fd, FWCTL_INFO, &info);
>
> This works today and generates the 24 byte command.
>
> Tomorrow the kernel adds a new member:
>
> struct fwctl_info {
> __u32 size;
> __u32 flags;
> __u32 out_device_type;
> __u32 device_data_len;
> __aligned_u64 out_device_data;
> __aligned_u64 new_thing;
> };
>
> Current builds of the userpace use a 24 byte command. A new kernel
> will see the 24 bytes and behave as before.
>
> When I recompile the userspace with the updated header it will issue a
> 32 byte command with no source change.
>
> Old kernel will see a 32 byte command with the trailing bytes it
> doesn't understand as 0 and keep working.
>
> The new kernel will see the new_thing bytes are zero and behave the
> same as before.
>
> If then the userspace decides to set new_thing the old kernel will
> stop working. Userspace can use some 'try and fail' approach to try
> again with new_thing = 0.
I'm not keen on try and fail interfaces because they become messy
if this has potentially be extended multiple times. Rest
of argument is fair enough. Thanks for the explanation.
>
> It gives a whole bunch of easy paths for userspace, otherwise
> userspace has to be very careful to match the size of the struct to
> the ABI it is targetting. Realistically nobody will do that right.
>
> Thanks,
> Jason
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device
2024-07-30 17:28 ` Jonathan Cameron
@ 2024-08-01 13:05 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-08-01 13:05 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Tue, Jul 30, 2024 at 06:28:08PM +0100, Jonathan Cameron wrote:
> > And the basic userspace pattern is:
> >
> > struct fwctl_info info = {.size = sizeof(info), ...);
> > ioctl(fd, FWCTL_INFO, &info);
> >
> > This works today and generates the 24 byte command.
> >
> > Tomorrow the kernel adds a new member:
> >
> > struct fwctl_info {
> > __u32 size;
> > __u32 flags;
> > __u32 out_device_type;
> > __u32 device_data_len;
> > __aligned_u64 out_device_data;
> > __aligned_u64 new_thing;
> > };
> >
> > Current builds of the userpace use a 24 byte command. A new kernel
> > will see the 24 bytes and behave as before.
> >
> > When I recompile the userspace with the updated header it will issue a
> > 32 byte command with no source change.
> >
> > Old kernel will see a 32 byte command with the trailing bytes it
> > doesn't understand as 0 and keep working.
> >
> > The new kernel will see the new_thing bytes are zero and behave the
> > same as before.
> >
> > If then the userspace decides to set new_thing the old kernel will
> > stop working. Userspace can use some 'try and fail' approach to try
> > again with new_thing = 0.
>
> I'm not keen on try and fail interfaces because they become messy
> if this has potentially be extended multiple times. Rest
> of argument is fair enough. Thanks for the explanation.
I'd say try-and-fail is just the universal option, if there is merit
we can put cap bits and other things to positively indicate increased
kernel capability.
We have quite a deep experiance on this topic now in RDMA, and there
we've been doing both options, depending on the situation.
For instance you might introduce a new API that returns FOO and extend
a prior API to optionally accept FOO as well. A cap flag that the new
API exists is useful [1], but it is not for the prior API. The
userspace can just blindly pass FOO to the prior API, and if it
happened to get a non-zero FOO somehow then the kernel must also
support it..
[1] "try and fail" works well here too you can invoke the IOCTL with a
0 size and you get ENOTTY if the IOCTL is not understood, and another
error code if it is.
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device
2024-06-24 22:47 ` [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device Jason Gunthorpe
2024-07-26 15:01 ` Jonathan Cameron
@ 2024-08-06 7:36 ` Daniel Vetter
2024-08-08 12:34 ` Jason Gunthorpe
1 sibling, 1 reply; 48+ messages in thread
From: Daniel Vetter @ 2024-08-06 7:36 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, Jun 24, 2024 at 07:47:26PM -0300, Jason Gunthorpe wrote:
> diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
> index ef4eaa87c945e4..1d9651de92fc19 100644
> --- a/include/linux/fwctl.h
> +++ b/include/linux/fwctl.h
> @@ -11,7 +11,20 @@
> struct fwctl_device;
> struct fwctl_uctx;
>
> +/**
> + * struct fwctl_ops - Driver provided operations
> + * @uctx_size: The size of the fwctl_uctx struct to allocate. The first
> + * bytes of this memory will be a fwctl_uctx. The driver can use the
> + * remaining bytes as its private memory.
> + * @open_uctx: Called when a file descriptor is opened before the uctx is ever
> + * used.
> + * @close_uctx: Called when the uctx is destroyed, usually when the FD is
> + * closed.
> + */
> struct fwctl_ops {
> + size_t uctx_size;
> + int (*open_uctx)(struct fwctl_uctx *uctx);
> + void (*close_uctx)(struct fwctl_uctx *uctx);
Just a small bikeshed, I much prefer the inline kerneldoc style for ops
structs. It allows you to be appropriately verbose and document details
like error handling (more important for later additions) or that e.g. they
all must finish timely for otherwise fwctl_unregister hangs, without the
comment becoming an eyesore.
-Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device
2024-08-06 7:36 ` Daniel Vetter
@ 2024-08-08 12:34 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-08-08 12:34 UTC (permalink / raw)
To: Daniel Vetter
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Tue, Aug 06, 2024 at 09:36:33AM +0200, Daniel Vetter wrote:
> On Mon, Jun 24, 2024 at 07:47:26PM -0300, Jason Gunthorpe wrote:
> > diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
> > index ef4eaa87c945e4..1d9651de92fc19 100644
> > --- a/include/linux/fwctl.h
> > +++ b/include/linux/fwctl.h
> > @@ -11,7 +11,20 @@
> > struct fwctl_device;
> > struct fwctl_uctx;
> >
> > +/**
> > + * struct fwctl_ops - Driver provided operations
> > + * @uctx_size: The size of the fwctl_uctx struct to allocate. The first
> > + * bytes of this memory will be a fwctl_uctx. The driver can use the
> > + * remaining bytes as its private memory.
> > + * @open_uctx: Called when a file descriptor is opened before the uctx is ever
> > + * used.
> > + * @close_uctx: Called when the uctx is destroyed, usually when the FD is
> > + * closed.
> > + */
> > struct fwctl_ops {
> > + size_t uctx_size;
> > + int (*open_uctx)(struct fwctl_uctx *uctx);
> > + void (*close_uctx)(struct fwctl_uctx *uctx);
>
> Just a small bikeshed, I much prefer the inline kerneldoc style for ops
> structs. It allows you to be appropriately verbose and document details
> like error handling (more important for later additions) or that e.g. they
> all must finish timely for otherwise fwctl_unregister hangs, without the
> comment becoming an eyesore.
Done
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
2024-06-24 22:47 ` [PATCH v2 1/8] fwctl: Add basic structure for a class subsystem with a cdev Jason Gunthorpe
2024-06-24 22:47 ` [PATCH v2 2/8] fwctl: Basic ioctl dispatch for the character device Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-07-26 15:15 ` Jonathan Cameron
2024-06-24 22:47 ` [PATCH v2 4/8] taint: Add TAINT_FWCTL Jason Gunthorpe
` (5 subsequent siblings)
8 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
Userspace will need to know some details about the fwctl interface being
used to locate the correct userspace code to communicate with the
kernel. Provide a simple device_type enum indicating what the kernel
driver is.
Allow the device to provide a device specific info struct that contains
any additional information that the driver may need to provide to
userspace.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/fwctl/main.c | 53 ++++++++++++++++++++++++++++++++++++++
include/linux/fwctl.h | 8 ++++++
include/uapi/fwctl/fwctl.h | 29 +++++++++++++++++++++
3 files changed, 90 insertions(+)
diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
index 6872c01d5c62e8..f1dec0b590aee4 100644
--- a/drivers/fwctl/main.c
+++ b/drivers/fwctl/main.c
@@ -17,6 +17,8 @@ enum {
static dev_t fwctl_dev;
static DEFINE_IDA(fwctl_ida);
+DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
+
struct fwctl_ucmd {
struct fwctl_uctx *uctx;
void __user *ubuffer;
@@ -24,8 +26,58 @@ struct fwctl_ucmd {
u32 user_size;
};
+static int ucmd_respond(struct fwctl_ucmd *ucmd, size_t cmd_len)
+{
+ if (copy_to_user(ucmd->ubuffer, ucmd->cmd,
+ min_t(size_t, ucmd->user_size, cmd_len)))
+ return -EFAULT;
+ return 0;
+}
+
+static int copy_to_user_zero_pad(void __user *to, const void *from,
+ size_t from_len, size_t user_len)
+{
+ size_t copy_len;
+
+ copy_len = min(from_len, user_len);
+ if (copy_to_user(to, from, copy_len))
+ return -EFAULT;
+ if (copy_len < user_len) {
+ if (clear_user(to + copy_len, user_len - copy_len))
+ return -EFAULT;
+ }
+ return 0;
+}
+
+static int fwctl_cmd_info(struct fwctl_ucmd *ucmd)
+{
+ struct fwctl_device *fwctl = ucmd->uctx->fwctl;
+ struct fwctl_info *cmd = ucmd->cmd;
+ size_t driver_info_len = 0;
+
+ if (cmd->flags)
+ return -EOPNOTSUPP;
+
+ if (cmd->device_data_len) {
+ void *driver_info __free(kfree_errptr) =
+ fwctl->ops->info(ucmd->uctx, &driver_info_len);
+ if (IS_ERR(driver_info))
+ return PTR_ERR(driver_info);
+
+ if (copy_to_user_zero_pad(u64_to_user_ptr(cmd->out_device_data),
+ driver_info, driver_info_len,
+ cmd->device_data_len))
+ return -EFAULT;
+ }
+
+ cmd->out_device_type = fwctl->ops->device_type;
+ cmd->device_data_len = driver_info_len;
+ return ucmd_respond(ucmd, sizeof(*cmd));
+}
+
/* On stack memory for the ioctl structs */
union ucmd_buffer {
+ struct fwctl_info info;
};
struct fwctl_ioctl_op {
@@ -45,6 +97,7 @@ struct fwctl_ioctl_op {
.execute = _fn, \
}
static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
+ IOCTL_OP(FWCTL_INFO, fwctl_cmd_info, struct fwctl_info, out_device_data),
};
static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
index 1d9651de92fc19..9a906b861acf3a 100644
--- a/include/linux/fwctl.h
+++ b/include/linux/fwctl.h
@@ -7,12 +7,14 @@
#include <linux/device.h>
#include <linux/cdev.h>
#include <linux/cleanup.h>
+#include <uapi/fwctl/fwctl.h>
struct fwctl_device;
struct fwctl_uctx;
/**
* struct fwctl_ops - Driver provided operations
+ * @device_type: The drivers assigned device_type number. This is uABI
* @uctx_size: The size of the fwctl_uctx struct to allocate. The first
* bytes of this memory will be a fwctl_uctx. The driver can use the
* remaining bytes as its private memory.
@@ -20,11 +22,17 @@ struct fwctl_uctx;
* used.
* @close_uctx: Called when the uctx is destroyed, usually when the FD is
* closed.
+ * @info: Implement FWCTL_INFO. Return a kmalloc() memory that is copied to
+ * out_device_data. On input length indicates the size of the user buffer
+ * on output it indicates the size of the memory. The driver can ignore
+ * length on input, the core code will handle everything.
*/
struct fwctl_ops {
+ enum fwctl_device_type device_type;
size_t uctx_size;
int (*open_uctx)(struct fwctl_uctx *uctx);
void (*close_uctx)(struct fwctl_uctx *uctx);
+ void *(*info)(struct fwctl_uctx *uctx, size_t *length);
};
/**
diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
index 0bdce95b6d69d9..39db9f09f8068e 100644
--- a/include/uapi/fwctl/fwctl.h
+++ b/include/uapi/fwctl/fwctl.h
@@ -36,6 +36,35 @@
*/
enum {
FWCTL_CMD_BASE = 0,
+ FWCTL_CMD_INFO = 0,
+ FWCTL_CMD_RPC = 1,
};
+enum fwctl_device_type {
+ FWCTL_DEVICE_TYPE_ERROR = 0,
+};
+
+/**
+ * struct fwctl_info - ioctl(FWCTL_INFO)
+ * @size: sizeof(struct fwctl_info)
+ * @flags: Must be 0
+ * @out_device_type: Returns the type of the device from enum fwctl_device_type
+ * @device_data_len: On input the length of the out_device_data memory. On
+ * output the size of the kernel's device_data which may be larger or
+ * smaller than the input. Maybe 0 on input.
+ * @out_device_data: Pointer to a memory of device_data_len bytes. Kernel will
+ * fill the entire memory, zeroing as required.
+ *
+ * Returns basic information about this fwctl instance, particularly what driver
+ * is being used to define the device_data format.
+ */
+struct fwctl_info {
+ __u32 size;
+ __u32 flags;
+ __u32 out_device_type;
+ __u32 device_data_len;
+ __aligned_u64 out_device_data;
+};
+#define FWCTL_INFO _IO(FWCTL_TYPE, FWCTL_CMD_INFO)
+
#endif
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device
2024-06-24 22:47 ` [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device Jason Gunthorpe
@ 2024-07-26 15:15 ` Jonathan Cameron
2024-07-29 16:35 ` Jason Gunthorpe
0 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-26 15:15 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, 24 Jun 2024 19:47:27 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> Userspace will need to know some details about the fwctl interface being
> used to locate the correct userspace code to communicate with the
> kernel. Provide a simple device_type enum indicating what the kernel
> driver is.
As below - maybe consider a UUID?
Would let you decouple allocating those with upstreaming drivers.
We'll just get annoying races on the enum otherwise as multiple
drivers get upstreamed that use this.
>
> Allow the device to provide a device specific info struct that contains
> any additional information that the driver may need to provide to
> userspace.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> drivers/fwctl/main.c | 53 ++++++++++++++++++++++++++++++++++++++
> include/linux/fwctl.h | 8 ++++++
> include/uapi/fwctl/fwctl.h | 29 +++++++++++++++++++++
> 3 files changed, 90 insertions(+)
>
> diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
> index 6872c01d5c62e8..f1dec0b590aee4 100644
> --- a/drivers/fwctl/main.c
> +++ b/drivers/fwctl/main.c
> @@ -17,6 +17,8 @@ enum {
> static dev_t fwctl_dev;
> static DEFINE_IDA(fwctl_ida);
>
> +DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
Why need for a new one? That's the same as the one in slab.h from
6.9 onwards. Before that it was
if (_T)
I was going to suggest promoting this to slab.h and then found
the normal implementation had been improved since I last checked.
>
> /**
> diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
> index 0bdce95b6d69d9..39db9f09f8068e 100644
> --- a/include/uapi/fwctl/fwctl.h
> +++ b/include/uapi/fwctl/fwctl.h
> @@ -36,6 +36,35 @@
> */
> enum {
> FWCTL_CMD_BASE = 0,
> + FWCTL_CMD_INFO = 0,
> + FWCTL_CMD_RPC = 1,
> };
>
> +enum fwctl_device_type {
> + FWCTL_DEVICE_TYPE_ERROR = 0,
> +};
> +
> +/**
> + * struct fwctl_info - ioctl(FWCTL_INFO)
> + * @size: sizeof(struct fwctl_info)
> + * @flags: Must be 0
> + * @out_device_type: Returns the type of the device from enum fwctl_device_type
Maybe a UUID? Avoid need to synchronize that list for ever.
> + * @device_data_len: On input the length of the out_device_data memory. On
> + * output the size of the kernel's device_data which may be larger or
> + * smaller than the input. Maybe 0 on input.
> + * @out_device_data: Pointer to a memory of device_data_len bytes. Kernel will
> + * fill the entire memory, zeroing as required.
Why do we need device in names of these two?
> + *
> + * Returns basic information about this fwctl instance, particularly what driver
> + * is being used to define the device_data format.
> + */
> +struct fwctl_info {
> + __u32 size;
> + __u32 flags;
> + __u32 out_device_type;
> + __u32 device_data_len;
> + __aligned_u64 out_device_data;
> +};
> +#define FWCTL_INFO _IO(FWCTL_TYPE, FWCTL_CMD_INFO)
> +
> #endif
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device
2024-07-26 15:15 ` Jonathan Cameron
@ 2024-07-29 16:35 ` Jason Gunthorpe
2024-07-30 17:34 ` Jonathan Cameron
0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-29 16:35 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Fri, Jul 26, 2024 at 04:15:03PM +0100, Jonathan Cameron wrote:
> On Mon, 24 Jun 2024 19:47:27 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> > Userspace will need to know some details about the fwctl interface being
> > used to locate the correct userspace code to communicate with the
> > kernel. Provide a simple device_type enum indicating what the kernel
> > driver is.
>
> As below - maybe consider a UUID?
> Would let you decouple allocating those with upstreaming drivers.
> We'll just get annoying races on the enum otherwise as multiple
> drivers get upstreamed that use this.
I view the coupling as a feature - controlling uABI number assignment
is one of the subtle motivations the kernel community has typically
used to encourage upstream participation.
> > static dev_t fwctl_dev;
> > static DEFINE_IDA(fwctl_ida);
> >
> > +DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
>
> Why need for a new one? That's the same as the one in slab.h from
> 6.9 onwards. Before that it was
> if (_T)
>
> I was going to suggest promoting this to slab.h and then found
> the normal implementation had been improved since I last checked.
Same, now it is improved it can go away.
> > +/**
> > + * struct fwctl_info - ioctl(FWCTL_INFO)
> > + * @size: sizeof(struct fwctl_info)
> > + * @flags: Must be 0
> > + * @out_device_type: Returns the type of the device from enum fwctl_device_type
>
> Maybe a UUID? Avoid need to synchronize that list for ever.
>
> > + * @device_data_len: On input the length of the out_device_data memory. On
> > + * output the size of the kernel's device_data which may be larger or
> > + * smaller than the input. Maybe 0 on input.
> > + * @out_device_data: Pointer to a memory of device_data_len bytes. Kernel will
> > + * fill the entire memory, zeroing as required.
>
> Why do we need device in names of these two?
I'm not sure I understand this question?
out_device_type returns the "name"
out_device_data returns a struct of data, the layout of the struct is
defined by out_device_type
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device
2024-07-29 16:35 ` Jason Gunthorpe
@ 2024-07-30 17:34 ` Jonathan Cameron
2024-08-01 13:11 ` Jason Gunthorpe
0 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-30 17:34 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, 29 Jul 2024 13:35:13 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> On Fri, Jul 26, 2024 at 04:15:03PM +0100, Jonathan Cameron wrote:
> > On Mon, 24 Jun 2024 19:47:27 -0300
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > > Userspace will need to know some details about the fwctl interface being
> > > used to locate the correct userspace code to communicate with the
> > > kernel. Provide a simple device_type enum indicating what the kernel
> > > driver is.
> >
> > As below - maybe consider a UUID?
> > Would let you decouple allocating those with upstreaming drivers.
> > We'll just get annoying races on the enum otherwise as multiple
> > drivers get upstreamed that use this.
>
> I view the coupling as a feature - controlling uABI number assignment
> is one of the subtle motivations the kernel community has typically
> used to encourage upstream participation.
Hmm. I'm not sure it's worth the possible pain if this becomes
popular. Maybe you'll have to run a reservation hotline.
>
> > > +/**
> > > + * struct fwctl_info - ioctl(FWCTL_INFO)
> > > + * @size: sizeof(struct fwctl_info)
> > > + * @flags: Must be 0
> > > + * @out_device_type: Returns the type of the device from enum fwctl_device_type
> >
> > Maybe a UUID? Avoid need to synchronize that list for ever.
> >
> > > + * @device_data_len: On input the length of the out_device_data memory. On
> > > + * output the size of the kernel's device_data which may be larger or
> > > + * smaller than the input. Maybe 0 on input.
> > > + * @out_device_data: Pointer to a memory of device_data_len bytes. Kernel will
> > > + * fill the entire memory, zeroing as required.
> >
> > Why do we need device in names of these two?
>
> I'm not sure I understand this question?
>
> out_device_type returns the "name"
>
> out_device_data returns a struct of data, the layout of the struct is
> defined by out_device_type
What is device in this case? fwctl struct device, hardware device, something else?
I'm not sure what the names give over
fwctl_type, out_data_len, out_data
The first one can't just be type as likely as not out_data contains a
type field specific to the fwctl_device_type.
I don't feel that strongly about this though, so stick to device
if you like. I'll get used to it.
Jonathan
>
> Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device
2024-07-30 17:34 ` Jonathan Cameron
@ 2024-08-01 13:11 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-08-01 13:11 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Tue, Jul 30, 2024 at 06:34:41PM +0100, Jonathan Cameron wrote:
> On Mon, 29 Jul 2024 13:35:13 -0300
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> > On Fri, Jul 26, 2024 at 04:15:03PM +0100, Jonathan Cameron wrote:
> > > On Mon, 24 Jun 2024 19:47:27 -0300
> > > Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > > Userspace will need to know some details about the fwctl interface being
> > > > used to locate the correct userspace code to communicate with the
> > > > kernel. Provide a simple device_type enum indicating what the kernel
> > > > driver is.
> > >
> > > As below - maybe consider a UUID?
> > > Would let you decouple allocating those with upstreaming drivers.
> > > We'll just get annoying races on the enum otherwise as multiple
> > > drivers get upstreamed that use this.
> >
> > I view the coupling as a feature - controlling uABI number assignment
> > is one of the subtle motivations the kernel community has typically
> > used to encourage upstream participation.
>
> Hmm. I'm not sure it's worth the possible pain if this becomes
> popular. Maybe you'll have to run a reservation hotline.
As long as there is one tree things get sorted. We'd need a big scale
of new drivers for this to be a practical problem. Big incentive for
people to get their stuff merged before shipping it. :)
> > > > + * @device_data_len: On input the length of the out_device_data memory. On
> > > > + * output the size of the kernel's device_data which may be larger or
> > > > + * smaller than the input. Maybe 0 on input.
> > > > + * @out_device_data: Pointer to a memory of device_data_len bytes. Kernel will
> > > > + * fill the entire memory, zeroing as required.
> > >
> > > Why do we need device in names of these two?
> >
> > I'm not sure I understand this question?
> >
> > out_device_type returns the "name"
> >
> > out_device_data returns a struct of data, the layout of the struct is
> > defined by out_device_type
>
> What is device in this case? fwctl struct device, hardware device, something else?
Oh, I see what you are asking..
fwctl is split into a common ABI and a "device" ABI which varies
depending on the fwctl driver. So The labeling was to put "device" in
front of those things that vary.
Basically if you touch a "device" field you need a userspace driver
that understands that device.
Not sure it is worth to have an explicit naming, it is sort of a
RDMAism creeping in where we called this concept "driver_data"
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 4/8] taint: Add TAINT_FWCTL
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
` (2 preceding siblings ...)
2024-06-24 22:47 ` [PATCH v2 3/8] fwctl: FWCTL_INFO to return basic information about the device Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-06-25 19:03 ` Randy Dunlap
2024-06-24 22:47 ` [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware Jason Gunthorpe
` (4 subsequent siblings)
8 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
Requesting a fwctl scope of access that includes mutating device debug
data will cause the kernel to be tainted. Changing the device operation
through things in the debug scope may cause the device to malfunction in
undefined ways. This should be reflected in the TAINT flags to help any
debuggers understand that something has been done.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
Documentation/admin-guide/tainted-kernels.rst | 5 +++++
include/linux/panic.h | 3 ++-
kernel/panic.c | 1 +
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index f92551539e8a66..f91a54966a9719 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted
16 _/X 65536 auxiliary taint, defined for and used by distros
17 _/T 131072 kernel was built with the struct randomization plugin
18 _/N 262144 an in-kernel test has been run
+ 19 _/J 524288 userspace used a mutating debug operation in fwctl
=== === ====== ========================================================
Note: The character ``_`` is representing a blank in this table to make reading
@@ -182,3 +183,7 @@ More detailed explanation for tainting
produce extremely unusual kernel structure layouts (even performance
pathological ones), which is important to know when debugging. Set at
build time.
+
+ 18) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
+ to use the devices debugging features. Device debugging features could
+ cause the device to malfunction in undefined ways.
diff --git a/include/linux/panic.h b/include/linux/panic.h
index 6717b15e798c38..5dfd5295effd40 100644
--- a/include/linux/panic.h
+++ b/include/linux/panic.h
@@ -73,7 +73,8 @@ static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout)
#define TAINT_AUX 16
#define TAINT_RANDSTRUCT 17
#define TAINT_TEST 18
-#define TAINT_FLAGS_COUNT 19
+#define TAINT_FWCTL 19
+#define TAINT_FLAGS_COUNT 20
#define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1)
struct taint_flag {
diff --git a/kernel/panic.c b/kernel/panic.c
index 8bff183d6180e7..b71f573ec7c5fc 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -494,6 +494,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
[ TAINT_AUX ] = { 'X', ' ', true },
[ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
[ TAINT_TEST ] = { 'N', ' ', true },
+ [ TAINT_FWCTL ] = { 'J', ' ', true },
};
/**
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 4/8] taint: Add TAINT_FWCTL
2024-06-24 22:47 ` [PATCH v2 4/8] taint: Add TAINT_FWCTL Jason Gunthorpe
@ 2024-06-25 19:03 ` Randy Dunlap
2024-07-10 16:04 ` Jason Gunthorpe
0 siblings, 1 reply; 48+ messages in thread
From: Randy Dunlap @ 2024-06-25 19:03 UTC (permalink / raw)
To: Jason Gunthorpe, Jonathan Corbet, Itay Avraham, Jakub Kicinski,
Leon Romanovsky, linux-doc, linux-rdma, netdev, Paolo Abeni,
Saeed Mahameed, Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
On 6/24/24 3:47 PM, Jason Gunthorpe wrote:
> Requesting a fwctl scope of access that includes mutating device debug
> data will cause the kernel to be tainted. Changing the device operation
> through things in the debug scope may cause the device to malfunction in
> undefined ways. This should be reflected in the TAINT flags to help any
> debuggers understand that something has been done.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Please also update tools/debugging/kernel-chktaint.
> ---
> Documentation/admin-guide/tainted-kernels.rst | 5 +++++
> include/linux/panic.h | 3 ++-
> kernel/panic.c | 1 +
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
> index f92551539e8a66..f91a54966a9719 100644
> --- a/Documentation/admin-guide/tainted-kernels.rst
> +++ b/Documentation/admin-guide/tainted-kernels.rst
> @@ -101,6 +101,7 @@ Bit Log Number Reason that got the kernel tainted
> 16 _/X 65536 auxiliary taint, defined for and used by distros
> 17 _/T 131072 kernel was built with the struct randomization plugin
> 18 _/N 262144 an in-kernel test has been run
> + 19 _/J 524288 userspace used a mutating debug operation in fwctl
> === === ====== ========================================================
>
> Note: The character ``_`` is representing a blank in this table to make reading
> @@ -182,3 +183,7 @@ More detailed explanation for tainting
> produce extremely unusual kernel structure layouts (even performance
> pathological ones), which is important to know when debugging. Set at
> build time.
> +
> + 18) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
> + to use the devices debugging features. Device debugging features could
> + cause the device to malfunction in undefined ways.
> diff --git a/include/linux/panic.h b/include/linux/panic.h
> index 6717b15e798c38..5dfd5295effd40 100644
> --- a/include/linux/panic.h
> +++ b/include/linux/panic.h
> @@ -73,7 +73,8 @@ static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout)
> #define TAINT_AUX 16
> #define TAINT_RANDSTRUCT 17
> #define TAINT_TEST 18
> -#define TAINT_FLAGS_COUNT 19
> +#define TAINT_FWCTL 19
> +#define TAINT_FLAGS_COUNT 20
> #define TAINT_FLAGS_MAX ((1UL << TAINT_FLAGS_COUNT) - 1)
>
> struct taint_flag {
> diff --git a/kernel/panic.c b/kernel/panic.c
> index 8bff183d6180e7..b71f573ec7c5fc 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -494,6 +494,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
> [ TAINT_AUX ] = { 'X', ' ', true },
> [ TAINT_RANDSTRUCT ] = { 'T', ' ', true },
> [ TAINT_TEST ] = { 'N', ' ', true },
> + [ TAINT_FWCTL ] = { 'J', ' ', true },
> };
>
> /**
--
~Randy
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 4/8] taint: Add TAINT_FWCTL
2024-06-25 19:03 ` Randy Dunlap
@ 2024-07-10 16:04 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-10 16:04 UTC (permalink / raw)
To: Randy Dunlap
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Tue, Jun 25, 2024 at 12:03:35PM -0700, Randy Dunlap wrote:
>
>
> On 6/24/24 3:47 PM, Jason Gunthorpe wrote:
> > Requesting a fwctl scope of access that includes mutating device debug
> > data will cause the kernel to be tainted. Changing the device operation
> > through things in the debug scope may cause the device to malfunction in
> > undefined ways. This should be reflected in the TAINT flags to help any
> > debuggers understand that something has been done.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>
> Please also update tools/debugging/kernel-chktaint.
Got it, thanks
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
` (3 preceding siblings ...)
2024-06-24 22:47 ` [PATCH v2 4/8] taint: Add TAINT_FWCTL Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-07-26 15:30 ` Jonathan Cameron
` (2 more replies)
2024-06-24 22:47 ` [PATCH v2 6/8] fwctl: Add documentation Jason Gunthorpe
` (3 subsequent siblings)
8 siblings, 3 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
Add the FWCTL_RPC ioctl which allows a request/response RPC call to device
firmware. Drivers implementing this call must follow the security
guidelines under Documentation/userspace-api/fwctl.rst
The core code provides some memory management helpers to get the messages
copied from and back to userspace. The driver is responsible for
allocating the output message memory and delivering the message to the
device.
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/fwctl/main.c | 62 +++++++++++++++++++++++++++++++++++
include/linux/fwctl.h | 5 +++
include/uapi/fwctl/fwctl.h | 66 ++++++++++++++++++++++++++++++++++++++
3 files changed, 133 insertions(+)
diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
index f1dec0b590aee4..9506b993a1a56d 100644
--- a/drivers/fwctl/main.c
+++ b/drivers/fwctl/main.c
@@ -8,16 +8,20 @@
#include <linux/slab.h>
#include <linux/container_of.h>
#include <linux/fs.h>
+#include <linux/sizes.h>
#include <uapi/fwctl/fwctl.h>
enum {
FWCTL_MAX_DEVICES = 256,
+ MAX_RPC_LEN = SZ_2M,
};
static dev_t fwctl_dev;
static DEFINE_IDA(fwctl_ida);
+static unsigned long fwctl_tainted;
DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
+DEFINE_FREE(kvfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kvfree(_T));
struct fwctl_ucmd {
struct fwctl_uctx *uctx;
@@ -75,9 +79,66 @@ static int fwctl_cmd_info(struct fwctl_ucmd *ucmd)
return ucmd_respond(ucmd, sizeof(*cmd));
}
+static int fwctl_cmd_rpc(struct fwctl_ucmd *ucmd)
+{
+ struct fwctl_device *fwctl = ucmd->uctx->fwctl;
+ struct fwctl_rpc *cmd = ucmd->cmd;
+ size_t out_len;
+
+ if (cmd->in_len > MAX_RPC_LEN || cmd->out_len > MAX_RPC_LEN)
+ return -EMSGSIZE;
+
+ switch (cmd->scope) {
+ case FWCTL_RPC_CONFIGURATION:
+ case FWCTL_RPC_DEBUG_READ_ONLY:
+ break;
+
+ case FWCTL_RPC_DEBUG_WRITE_FULL:
+ if (!capable(CAP_SYS_RAWIO))
+ return -EPERM;
+ fallthrough;
+ case FWCTL_RPC_DEBUG_WRITE:
+ if (!test_and_set_bit(0, &fwctl_tainted)) {
+ dev_warn(
+ &fwctl->dev,
+ "%s(%d): has requested full access to the physical device device",
+ current->comm, task_pid_nr(current));
+ add_taint(TAINT_FWCTL, LOCKDEP_STILL_OK);
+ }
+ break;
+ default:
+ return -EOPNOTSUPP;
+ };
+
+ void *inbuf __free(kvfree) =
+ kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
+ if (!inbuf)
+ return -ENOMEM;
+ if (copy_from_user(inbuf, u64_to_user_ptr(cmd->in), cmd->in_len))
+ return -EFAULT;
+
+ out_len = cmd->out_len;
+ void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
+ ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
+ if (IS_ERR(outbuf))
+ return PTR_ERR(outbuf);
+ if (outbuf == inbuf) {
+ /* The driver can re-use inbuf as outbuf */
+ inbuf = NULL;
+ }
+
+ if (copy_to_user(u64_to_user_ptr(cmd->out), outbuf,
+ min(cmd->out_len, out_len)))
+ return -EFAULT;
+
+ cmd->out_len = out_len;
+ return ucmd_respond(ucmd, sizeof(*cmd));
+}
+
/* On stack memory for the ioctl structs */
union ucmd_buffer {
struct fwctl_info info;
+ struct fwctl_rpc rpc;
};
struct fwctl_ioctl_op {
@@ -98,6 +159,7 @@ struct fwctl_ioctl_op {
}
static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
IOCTL_OP(FWCTL_INFO, fwctl_cmd_info, struct fwctl_info, out_device_data),
+ IOCTL_OP(FWCTL_RPC, fwctl_cmd_rpc, struct fwctl_rpc, out),
};
static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
index 9a906b861acf3a..294cfbf63306a2 100644
--- a/include/linux/fwctl.h
+++ b/include/linux/fwctl.h
@@ -26,6 +26,9 @@ struct fwctl_uctx;
* out_device_data. On input length indicates the size of the user buffer
* on output it indicates the size of the memory. The driver can ignore
* length on input, the core code will handle everything.
+ * @fw_rpc: Implement FWCTL_RPC. Deliver rpc_in/in_len to the FW and return
+ * the response and set out_len. rpc_in can be returned as the response
+ * pointer. Otherwise the returned pointer is freed with kvfree().
*/
struct fwctl_ops {
enum fwctl_device_type device_type;
@@ -33,6 +36,8 @@ struct fwctl_ops {
int (*open_uctx)(struct fwctl_uctx *uctx);
void (*close_uctx)(struct fwctl_uctx *uctx);
void *(*info)(struct fwctl_uctx *uctx, size_t *length);
+ void *(*fw_rpc)(struct fwctl_uctx *uctx, enum fwctl_rpc_scope scope,
+ void *rpc_in, size_t in_len, size_t *out_len);
};
/**
diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
index 39db9f09f8068e..8bde0d4416fd55 100644
--- a/include/uapi/fwctl/fwctl.h
+++ b/include/uapi/fwctl/fwctl.h
@@ -67,4 +67,70 @@ struct fwctl_info {
};
#define FWCTL_INFO _IO(FWCTL_TYPE, FWCTL_CMD_INFO)
+/**
+ * enum fwctl_rpc_scope - Scope of access for the RPC
+ */
+enum fwctl_rpc_scope {
+ /**
+ * @FWCTL_RPC_CONFIGURATION: Device configuration access scope
+ *
+ * Read/write access to device configuration. When configuration
+ * is written to the device remains in a fully supported state.
+ */
+ FWCTL_RPC_CONFIGURATION = 0,
+ /**
+ * @FWCTL_RPC_DEBUG_READ_ONLY: Read only access to debug information
+ *
+ * Readable debug information. Debug information is compatible with
+ * kernel lockdown, and does not disclose any sensitive information. For
+ * instance exposing any encryption secrets from this information is
+ * forbidden.
+ */
+ FWCTL_RPC_DEBUG_READ_ONLY = 1,
+ /**
+ * @FWCTL_RPC_DEBUG_WRITE: Writable access to lockdown compatible debug information
+ *
+ * Allows write access to data in the device which may leave a fully
+ * supported state. This is intended to permit intensive and possibly
+ * invasive debugging. This scope will taint the kernel.
+ */
+ FWCTL_RPC_DEBUG_WRITE = 2,
+ /**
+ * @FWCTL_RPC_DEBUG_WRITE_FULL: Writable access to all debug information
+ *
+ * Allows read/write access to everything. Requires CAP_SYS_RAW_IO, so
+ * it is not required to follow lockdown principals. If in doubt
+ * debugging should be placed in this scope. This scope will taint the
+ * kernel.
+ */
+ FWCTL_RPC_DEBUG_WRITE_FULL = 3,
+};
+
+/**
+ * struct fwctl_rpc - ioctl(FWCTL_RPC)
+ * @size: sizeof(struct fwctl_rpc)
+ * @scope: One of enum fwctl_rpc_scope, required scope for the RPC
+ * @in_len: Length of the in memory
+ * @out_len: Length of the out memory
+ * @in: Request message in device specific format
+ * @out: Response message in device specific format
+ *
+ * Deliver a Remote Procedure Call to the device FW and return the response. The
+ * call's parameters and return are marshaled into linear buffers of memory. Any
+ * errno indicates that delivery of the RPC to the device failed. Return status
+ * originating in the device during a successful delivery must be encoded into
+ * out.
+ *
+ * The format of the buffers matches the out_device_type from FWCTL_INFO.
+ */
+struct fwctl_rpc {
+ __u32 size;
+ __u32 scope;
+ __u32 in_len;
+ __u32 out_len;
+ __aligned_u64 in;
+ __aligned_u64 out;
+};
+#define FWCTL_RPC _IO(FWCTL_TYPE, FWCTL_CMD_RPC)
+
#endif
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-06-24 22:47 ` [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware Jason Gunthorpe
@ 2024-07-26 15:30 ` Jonathan Cameron
2024-07-29 16:28 ` Jason Gunthorpe
2024-07-30 8:00 ` Leon Romanovsky
2024-08-07 7:44 ` Oded Gabbay
2 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-26 15:30 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, 24 Jun 2024 19:47:29 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> Add the FWCTL_RPC ioctl which allows a request/response RPC call to device
> firmware. Drivers implementing this call must follow the security
> guidelines under Documentation/userspace-api/fwctl.rst
>
> The core code provides some memory management helpers to get the messages
> copied from and back to userspace. The driver is responsible for
> allocating the output message memory and delivering the message to the
> device.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A few minor things inline.
> ---
> drivers/fwctl/main.c | 62 +++++++++++++++++++++++++++++++++++
> include/linux/fwctl.h | 5 +++
> include/uapi/fwctl/fwctl.h | 66 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 133 insertions(+)
>
> diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
> index f1dec0b590aee4..9506b993a1a56d 100644
> --- a/drivers/fwctl/main.c
> +++ b/drivers/fwctl/main.c
> @@ -8,16 +8,20 @@
> #include <linux/slab.h>
> #include <linux/container_of.h>
> #include <linux/fs.h>
> +#include <linux/sizes.h>
>
> #include <uapi/fwctl/fwctl.h>
>
> enum {
> FWCTL_MAX_DEVICES = 256,
> + MAX_RPC_LEN = SZ_2M,
> };
In what way is that usefully handled as an enum?
I'd just use #defines
> static dev_t fwctl_dev;
> static DEFINE_IDA(fwctl_ida);
> +static unsigned long fwctl_tainted;
>
> DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
> +DEFINE_FREE(kvfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kvfree(_T));
kvfree define free already defined as this since 6.9
>
> struct fwctl_ucmd {
> struct fwctl_uctx *uctx;
> @@ -75,9 +79,66 @@ static int fwctl_cmd_info(struct fwctl_ucmd *ucmd)
> return ucmd_respond(ucmd, sizeof(*cmd));
> }
>
> +static int fwctl_cmd_rpc(struct fwctl_ucmd *ucmd)
> +{
> + struct fwctl_device *fwctl = ucmd->uctx->fwctl;
> + struct fwctl_rpc *cmd = ucmd->cmd;
> + size_t out_len;
> +
> + if (cmd->in_len > MAX_RPC_LEN || cmd->out_len > MAX_RPC_LEN)
> + return -EMSGSIZE;
> +
> + switch (cmd->scope) {
> + case FWCTL_RPC_CONFIGURATION:
> + case FWCTL_RPC_DEBUG_READ_ONLY:
> + break;
> +
> + case FWCTL_RPC_DEBUG_WRITE_FULL:
> + if (!capable(CAP_SYS_RAWIO))
> + return -EPERM;
> + fallthrough;
> + case FWCTL_RPC_DEBUG_WRITE:
> + if (!test_and_set_bit(0, &fwctl_tainted)) {
> + dev_warn(
> + &fwctl->dev,
> + "%s(%d): has requested full access to the physical device device",
> + current->comm, task_pid_nr(current));
> + add_taint(TAINT_FWCTL, LOCKDEP_STILL_OK);
> + }
> + break;
> + default:
> + return -EOPNOTSUPP;
> + };
> +
> + void *inbuf __free(kvfree) =
> + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
As before
#define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
so don't need both.
> + if (!inbuf)
> + return -ENOMEM;
> + if (copy_from_user(inbuf, u64_to_user_ptr(cmd->in), cmd->in_len))
> + return -EFAULT;
> +
> + out_len = cmd->out_len;
> + void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
> + ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
> + if (IS_ERR(outbuf))
> + return PTR_ERR(outbuf);
> + if (outbuf == inbuf) {
> + /* The driver can re-use inbuf as outbuf */
> + inbuf = NULL;
I wish no_free_ptr() didn't have __must_check. Can do something ugly
like
outbuf = no_free_ptr(inbuf);
probably but maybe just setting it NULL is simpler.
> + }
> +
> + if (copy_to_user(u64_to_user_ptr(cmd->out), outbuf,
> + min(cmd->out_len, out_len)))
> + return -EFAULT;
> +
> + cmd->out_len = out_len;
> + return ucmd_respond(ucmd, sizeof(*cmd));
> +}
> diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
> index 39db9f09f8068e..8bde0d4416fd55 100644
> --- a/include/uapi/fwctl/fwctl.h
> +++ b/include/uapi/fwctl/fwctl.h
> @@ -67,4 +67,70 @@ struct fwctl_info {
> };
> #define FWCTL_INFO _IO(FWCTL_TYPE, FWCTL_CMD_INFO)
>
> +/**
> + * enum fwctl_rpc_scope - Scope of access for the RPC
> + */
> +enum fwctl_rpc_scope {
...
> + /**
> + * @FWCTL_RPC_DEBUG_READ_ONLY: Read only access to debug information
> + *
> + * Readable debug information. Debug information is compatible with
> + * kernel lockdown, and does not disclose any sensitive information. For
> + * instance exposing any encryption secrets from this information is
> + * forbidden.
> + */
> + FWCTL_RPC_DEBUG_READ_ONLY = 1,
> + /**
> + * @FWCTL_RPC_DEBUG_WRITE: Writable access to lockdown compatible debug information
Write access
probably rather than writeable.
> + *
> + * Allows write access to data in the device which may leave a fully
> + * supported state. This is intended to permit intensive and possibly
> + * invasive debugging. This scope will taint the kernel.
> + */
> +};
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-07-26 15:30 ` Jonathan Cameron
@ 2024-07-29 16:28 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-29 16:28 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Fri, Jul 26, 2024 at 04:30:09PM +0100, Jonathan Cameron wrote:
> > @@ -8,16 +8,20 @@
> > #include <linux/slab.h>
> > #include <linux/container_of.h>
> > #include <linux/fs.h>
> > +#include <linux/sizes.h>
> >
> > #include <uapi/fwctl/fwctl.h>
> >
> > enum {
> > FWCTL_MAX_DEVICES = 256,
> > + MAX_RPC_LEN = SZ_2M,
> > };
>
> In what way is that usefully handled as an enum?
> I'd just use #defines
I generally am not so keen on defines for constants.. There is some
advantage with clangd and gdb, for instance. Enum is the only other
option even though it is a bit of abuse to use it like this.
> > DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
> > +DEFINE_FREE(kvfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kvfree(_T));
> kvfree define free already defined as this since 6.9
Ok
> > + void *inbuf __free(kvfree) =
> > + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
>
> As before
> #define GFP_KERNEL_ACCOUNT (GFP_KERNEL | __GFP_ACCOUNT)
> so don't need both.
Yep
> > + if (outbuf == inbuf) {
> > + /* The driver can re-use inbuf as outbuf */
> > + inbuf = NULL;
> I wish no_free_ptr() didn't have __must_check. Can do something ugly
> like
> outbuf = no_free_ptr(inbuf);
> probably but maybe just setting it NULL is simpler.
Yeah NULL seems clearer, the outbuf assignment is a bit too odd, IMHO
> > + /**
> > + * @FWCTL_RPC_DEBUG_READ_ONLY: Read only access to debug information
> > + *
> > + * Readable debug information. Debug information is compatible with
> > + * kernel lockdown, and does not disclose any sensitive information. For
> > + * instance exposing any encryption secrets from this information is
> > + * forbidden.
> > + */
> > + FWCTL_RPC_DEBUG_READ_ONLY = 1,
> > + /**
> > + * @FWCTL_RPC_DEBUG_WRITE: Writable access to lockdown compatible debug information
>
> Write access
> probably rather than writeable.
Sure
Thanks,
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-06-24 22:47 ` [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware Jason Gunthorpe
2024-07-26 15:30 ` Jonathan Cameron
@ 2024-07-30 8:00 ` Leon Romanovsky
2024-08-01 12:58 ` Jason Gunthorpe
2024-08-07 7:44 ` Oded Gabbay
2 siblings, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2024-07-30 8:00 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, linux-doc,
linux-rdma, netdev, Paolo Abeni, Saeed Mahameed, Tariq Toukan,
Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, linux-cxl, patches
On Mon, Jun 24, 2024 at 07:47:29PM -0300, Jason Gunthorpe wrote:
> Add the FWCTL_RPC ioctl which allows a request/response RPC call to device
> firmware. Drivers implementing this call must follow the security
> guidelines under Documentation/userspace-api/fwctl.rst
>
> The core code provides some memory management helpers to get the messages
> copied from and back to userspace. The driver is responsible for
> allocating the output message memory and delivering the message to the
> device.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> drivers/fwctl/main.c | 62 +++++++++++++++++++++++++++++++++++
> include/linux/fwctl.h | 5 +++
> include/uapi/fwctl/fwctl.h | 66 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 133 insertions(+)
>
> diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
> index f1dec0b590aee4..9506b993a1a56d 100644
> --- a/drivers/fwctl/main.c
> +++ b/drivers/fwctl/main.c
> @@ -8,16 +8,20 @@
> #include <linux/slab.h>
> #include <linux/container_of.h>
> #include <linux/fs.h>
> +#include <linux/sizes.h>
>
> #include <uapi/fwctl/fwctl.h>
>
> enum {
> FWCTL_MAX_DEVICES = 256,
> + MAX_RPC_LEN = SZ_2M,
> };
> static dev_t fwctl_dev;
> static DEFINE_IDA(fwctl_ida);
> +static unsigned long fwctl_tainted;
>
> DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
> +DEFINE_FREE(kvfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kvfree(_T));
>
> struct fwctl_ucmd {
> struct fwctl_uctx *uctx;
> @@ -75,9 +79,66 @@ static int fwctl_cmd_info(struct fwctl_ucmd *ucmd)
> return ucmd_respond(ucmd, sizeof(*cmd));
> }
>
> +static int fwctl_cmd_rpc(struct fwctl_ucmd *ucmd)
> +{
> + struct fwctl_device *fwctl = ucmd->uctx->fwctl;
> + struct fwctl_rpc *cmd = ucmd->cmd;
> + size_t out_len;
> +
> + if (cmd->in_len > MAX_RPC_LEN || cmd->out_len > MAX_RPC_LEN)
> + return -EMSGSIZE;
> +
> + switch (cmd->scope) {
> + case FWCTL_RPC_CONFIGURATION:
> + case FWCTL_RPC_DEBUG_READ_ONLY:
> + break;
> +
> + case FWCTL_RPC_DEBUG_WRITE_FULL:
> + if (!capable(CAP_SYS_RAWIO))
> + return -EPERM;
> + fallthrough;
> + case FWCTL_RPC_DEBUG_WRITE:
> + if (!test_and_set_bit(0, &fwctl_tainted)) {
> + dev_warn(
> + &fwctl->dev,
> + "%s(%d): has requested full access to the physical device device",
> + current->comm, task_pid_nr(current));
> + add_taint(TAINT_FWCTL, LOCKDEP_STILL_OK);
> + }
> + break;
> + default:
> + return -EOPNOTSUPP;
> + };
> +
> + void *inbuf __free(kvfree) =
> + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
<...>
> + out_len = cmd->out_len;
> + void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
> + ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
I was under impression that declaration of variables in C should be at the beginning
of block. Was it changed for the kernel?
Thanks
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-07-30 8:00 ` Leon Romanovsky
@ 2024-08-01 12:58 ` Jason Gunthorpe
2024-08-01 17:26 ` Leon Romanovsky
0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-08-01 12:58 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, linux-doc,
linux-rdma, netdev, Paolo Abeni, Saeed Mahameed, Tariq Toukan,
Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, linux-cxl, patches
On Tue, Jul 30, 2024 at 11:00:38AM +0300, Leon Romanovsky wrote:
> > +
> > + void *inbuf __free(kvfree) =
> > + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
>
>
> <...>
>
> > + out_len = cmd->out_len;
> > + void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
> > + ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
>
> I was under impression that declaration of variables in C should be at the beginning
> of block. Was it changed for the kernel?
Yes, the compiler check blocking variables in the body was disabled to
allow cleanup.h
Jonathan said this is the agreed coding style to use for this
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-08-01 12:58 ` Jason Gunthorpe
@ 2024-08-01 17:26 ` Leon Romanovsky
2024-08-02 13:59 ` Jonathan Cameron
0 siblings, 1 reply; 48+ messages in thread
From: Leon Romanovsky @ 2024-08-01 17:26 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, linux-doc,
linux-rdma, netdev, Paolo Abeni, Saeed Mahameed, Tariq Toukan,
Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, linux-cxl, patches
On Thu, Aug 01, 2024 at 09:58:29AM -0300, Jason Gunthorpe wrote:
> On Tue, Jul 30, 2024 at 11:00:38AM +0300, Leon Romanovsky wrote:
> > > +
> > > + void *inbuf __free(kvfree) =
> > > + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
> >
> >
> > <...>
> >
> > > + out_len = cmd->out_len;
> > > + void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
> > > + ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
> >
> > I was under impression that declaration of variables in C should be at the beginning
> > of block. Was it changed for the kernel?
>
> Yes, the compiler check blocking variables in the body was disabled to
> allow cleanup.h
>
> Jonathan said this is the agreed coding style to use for this
I'm said to hear that.
Thanks
>
> Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-08-01 17:26 ` Leon Romanovsky
@ 2024-08-02 13:59 ` Jonathan Cameron
2024-08-02 15:57 ` Leon Romanovsky
0 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-08-02 13:59 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Jason Gunthorpe, Jonathan Corbet, Itay Avraham, Jakub Kicinski,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
linux-cxl, patches, Jonathan Corbet
On Thu, 1 Aug 2024 20:26:31 +0300
Leon Romanovsky <leon@kernel.org> wrote:
> On Thu, Aug 01, 2024 at 09:58:29AM -0300, Jason Gunthorpe wrote:
> > On Tue, Jul 30, 2024 at 11:00:38AM +0300, Leon Romanovsky wrote:
> > > > +
> > > > + void *inbuf __free(kvfree) =
> > > > + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
> > >
> > >
> > > <...>
> > >
> > > > + out_len = cmd->out_len;
> > > > + void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
> > > > + ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
> > >
> > > I was under impression that declaration of variables in C should be at the beginning
> > > of block. Was it changed for the kernel?
> >
> > Yes, the compiler check blocking variables in the body was disabled to
> > allow cleanup.h
> >
> > Jonathan said this is the agreed coding style to use for this
>
> I'm said to hear that.
Was passing on a statement Linus made (not digging it out right now)
that he really wanted to be able see constructors and destructors
together.
The other part is that in some cases you can end up with non
obvious ordering bugs because the cleanup is the reverse of the
declarations, not the constructors being called.
Whilst it is fairly easy to review for this, future code reorganization
may well lead to subtle bugs, typically in error paths etc.
Putting the declaration inline avoids this potential problem
Dan wrote a style guide proposal.
https://lore.kernel.org/all/171175585714.2192972.12661675876300167762.stgit@dwillia2-xfh.jf.intel.com/
[PATCH v3] cleanup: Add usage and style documentation
seems it died out without anyone applying it. I've poked.
Jonathan
>
> Thanks
>
> >
> > Jason
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-08-02 13:59 ` Jonathan Cameron
@ 2024-08-02 15:57 ` Leon Romanovsky
0 siblings, 0 replies; 48+ messages in thread
From: Leon Romanovsky @ 2024-08-02 15:57 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jason Gunthorpe, Jonathan Corbet, Itay Avraham, Jakub Kicinski,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
linux-cxl, patches
On Fri, Aug 02, 2024 at 02:59:46PM +0100, Jonathan Cameron wrote:
> On Thu, 1 Aug 2024 20:26:31 +0300
> Leon Romanovsky <leon@kernel.org> wrote:
>
> > On Thu, Aug 01, 2024 at 09:58:29AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Jul 30, 2024 at 11:00:38AM +0300, Leon Romanovsky wrote:
> > > > > +
> > > > > + void *inbuf __free(kvfree) =
> > > > > + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
> > > >
> > > >
> > > > <...>
> > > >
> > > > > + out_len = cmd->out_len;
> > > > > + void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
> > > > > + ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
> > > >
> > > > I was under impression that declaration of variables in C should be at the beginning
> > > > of block. Was it changed for the kernel?
> > >
> > > Yes, the compiler check blocking variables in the body was disabled to
> > > allow cleanup.h
> > >
> > > Jonathan said this is the agreed coding style to use for this
> >
> > I'm said to hear that.
>
> Was passing on a statement Linus made (not digging it out right now)
> that he really wanted to be able see constructors and destructors
> together.
The thing is that we are talking about the same thing. I and Linus want
to keep locality of variables declaration and initialization. I don't
know the Linus's stance on it, but I'm sad that to achieve that for
cleanup.h, very useful feature of GCC (keep variables at the beginning
of the block) was disabled.
Right now, you can declare variables in any place and it is harder to
review the code now. It is a matter of time when we will see code like
this and start to chase bugs introduced by this pattern:
int f()
{
<some code>
int i;
<some code>
return something;
}
Thanks
>
> The other part is that in some cases you can end up with non
> obvious ordering bugs because the cleanup is the reverse of the
> declarations, not the constructors being called.
> Whilst it is fairly easy to review for this, future code reorganization
> may well lead to subtle bugs, typically in error paths etc.
>
> Putting the declaration inline avoids this potential problem
>
> Dan wrote a style guide proposal.
> https://lore.kernel.org/all/171175585714.2192972.12661675876300167762.stgit@dwillia2-xfh.jf.intel.com/
> [PATCH v3] cleanup: Add usage and style documentation
>
> seems it died out without anyone applying it. I've poked.
>
> Jonathan
>
> >
> > Thanks
> >
> > >
> > > Jason
> >
>
>
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-06-24 22:47 ` [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware Jason Gunthorpe
2024-07-26 15:30 ` Jonathan Cameron
2024-07-30 8:00 ` Leon Romanovsky
@ 2024-08-07 7:44 ` Oded Gabbay
2024-08-08 11:46 ` Jason Gunthorpe
2 siblings, 1 reply; 48+ messages in thread
From: Oded Gabbay @ 2024-08-07 7:44 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, Jun 24, 2024 at 07:47:29PM -0300, Jason Gunthorpe wrote:
> Add the FWCTL_RPC ioctl which allows a request/response RPC call to device
> firmware. Drivers implementing this call must follow the security
> guidelines under Documentation/userspace-api/fwctl.rst
>
> The core code provides some memory management helpers to get the messages
> copied from and back to userspace. The driver is responsible for
> allocating the output message memory and delivering the message to the
> device.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> drivers/fwctl/main.c | 62 +++++++++++++++++++++++++++++++++++
> include/linux/fwctl.h | 5 +++
> include/uapi/fwctl/fwctl.h | 66 ++++++++++++++++++++++++++++++++++++++
> 3 files changed, 133 insertions(+)
>
> diff --git a/drivers/fwctl/main.c b/drivers/fwctl/main.c
> index f1dec0b590aee4..9506b993a1a56d 100644
> --- a/drivers/fwctl/main.c
> +++ b/drivers/fwctl/main.c
> @@ -8,16 +8,20 @@
> #include <linux/slab.h>
> #include <linux/container_of.h>
> #include <linux/fs.h>
> +#include <linux/sizes.h>
>
> #include <uapi/fwctl/fwctl.h>
>
> enum {
> FWCTL_MAX_DEVICES = 256,
> + MAX_RPC_LEN = SZ_2M,
> };
> static dev_t fwctl_dev;
> static DEFINE_IDA(fwctl_ida);
> +static unsigned long fwctl_tainted;
>
> DEFINE_FREE(kfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kfree(_T));
> +DEFINE_FREE(kvfree_errptr, void *, if (!IS_ERR_OR_NULL(_T)) kvfree(_T));
>
> struct fwctl_ucmd {
> struct fwctl_uctx *uctx;
> @@ -75,9 +79,66 @@ static int fwctl_cmd_info(struct fwctl_ucmd *ucmd)
> return ucmd_respond(ucmd, sizeof(*cmd));
> }
>
> +static int fwctl_cmd_rpc(struct fwctl_ucmd *ucmd)
> +{
> + struct fwctl_device *fwctl = ucmd->uctx->fwctl;
> + struct fwctl_rpc *cmd = ucmd->cmd;
> + size_t out_len;
> +
> + if (cmd->in_len > MAX_RPC_LEN || cmd->out_len > MAX_RPC_LEN)
> + return -EMSGSIZE;
> +
> + switch (cmd->scope) {
> + case FWCTL_RPC_CONFIGURATION:
> + case FWCTL_RPC_DEBUG_READ_ONLY:
> + break;
> +
> + case FWCTL_RPC_DEBUG_WRITE_FULL:
> + if (!capable(CAP_SYS_RAWIO))
> + return -EPERM;
> + fallthrough;
> + case FWCTL_RPC_DEBUG_WRITE:
> + if (!test_and_set_bit(0, &fwctl_tainted)) {
> + dev_warn(
> + &fwctl->dev,
> + "%s(%d): has requested full access to the physical device device",
> + current->comm, task_pid_nr(current));
> + add_taint(TAINT_FWCTL, LOCKDEP_STILL_OK);
> + }
> + break;
> + default:
> + return -EOPNOTSUPP;
> + };
> +
> + void *inbuf __free(kvfree) =
> + kvzalloc(cmd->in_len, GFP_KERNEL | GFP_KERNEL_ACCOUNT);
> + if (!inbuf)
> + return -ENOMEM;
> + if (copy_from_user(inbuf, u64_to_user_ptr(cmd->in), cmd->in_len))
> + return -EFAULT;
> +
> + out_len = cmd->out_len;
> + void *outbuf __free(kvfree_errptr) = fwctl->ops->fw_rpc(
> + ucmd->uctx, cmd->scope, inbuf, cmd->in_len, &out_len);
Hi Jason,
First of all, great work. I fully support this direction. Although I'm not
working anymore in Habana, I would have definitely moved to this interface
instead of the custom one we implemented in our driver. I believe this can
be of use for other accel drivers as well.
Complex devices which contains multiple IPs and FWs, like Habana's Gaudi,
have some RPCs to the firmware which are not related to the funcationality
of the IP drivers (the compute and networking drivers in our case).
Spreading the RPCs between the drivers required to separate it also among
the different user-spaces libraries, as each subsystem has its own user-space.
Disconnecting the RPCs from the drivers and providing a generic interface will
allow to have a single user-space library which will be able to communicate
with the firmware for all the IPs in the device. In Habana's case, it was
mainly for monitoring and debugging purposes.
I do have one question about the rpc ioctl. Are you assuming that the rpc
is synchronous (we send a message to the firmware and block until we get the
reply)? If so, what happen if we have an async RPC implementation
inside the driver? How would you recommend to handle it?
Thanks,
Oded
> + if (IS_ERR(outbuf))
> + return PTR_ERR(outbuf);
> + if (outbuf == inbuf) {
> + /* The driver can re-use inbuf as outbuf */
> + inbuf = NULL;
> + }
> +
> + if (copy_to_user(u64_to_user_ptr(cmd->out), outbuf,
> + min(cmd->out_len, out_len)))
> + return -EFAULT;
> +
> + cmd->out_len = out_len;
> + return ucmd_respond(ucmd, sizeof(*cmd));
> +}
> +
> /* On stack memory for the ioctl structs */
> union ucmd_buffer {
> struct fwctl_info info;
> + struct fwctl_rpc rpc;
> };
>
> struct fwctl_ioctl_op {
> @@ -98,6 +159,7 @@ struct fwctl_ioctl_op {
> }
> static const struct fwctl_ioctl_op fwctl_ioctl_ops[] = {
> IOCTL_OP(FWCTL_INFO, fwctl_cmd_info, struct fwctl_info, out_device_data),
> + IOCTL_OP(FWCTL_RPC, fwctl_cmd_rpc, struct fwctl_rpc, out),
> };
>
> static long fwctl_fops_ioctl(struct file *filp, unsigned int cmd,
> diff --git a/include/linux/fwctl.h b/include/linux/fwctl.h
> index 9a906b861acf3a..294cfbf63306a2 100644
> --- a/include/linux/fwctl.h
> +++ b/include/linux/fwctl.h
> @@ -26,6 +26,9 @@ struct fwctl_uctx;
> * out_device_data. On input length indicates the size of the user buffer
> * on output it indicates the size of the memory. The driver can ignore
> * length on input, the core code will handle everything.
> + * @fw_rpc: Implement FWCTL_RPC. Deliver rpc_in/in_len to the FW and return
> + * the response and set out_len. rpc_in can be returned as the response
> + * pointer. Otherwise the returned pointer is freed with kvfree().
> */
> struct fwctl_ops {
> enum fwctl_device_type device_type;
> @@ -33,6 +36,8 @@ struct fwctl_ops {
> int (*open_uctx)(struct fwctl_uctx *uctx);
> void (*close_uctx)(struct fwctl_uctx *uctx);
> void *(*info)(struct fwctl_uctx *uctx, size_t *length);
> + void *(*fw_rpc)(struct fwctl_uctx *uctx, enum fwctl_rpc_scope scope,
> + void *rpc_in, size_t in_len, size_t *out_len);
> };
>
> /**
> diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
> index 39db9f09f8068e..8bde0d4416fd55 100644
> --- a/include/uapi/fwctl/fwctl.h
> +++ b/include/uapi/fwctl/fwctl.h
> @@ -67,4 +67,70 @@ struct fwctl_info {
> };
> #define FWCTL_INFO _IO(FWCTL_TYPE, FWCTL_CMD_INFO)
>
> +/**
> + * enum fwctl_rpc_scope - Scope of access for the RPC
> + */
> +enum fwctl_rpc_scope {
> + /**
> + * @FWCTL_RPC_CONFIGURATION: Device configuration access scope
> + *
> + * Read/write access to device configuration. When configuration
> + * is written to the device remains in a fully supported state.
> + */
> + FWCTL_RPC_CONFIGURATION = 0,
> + /**
> + * @FWCTL_RPC_DEBUG_READ_ONLY: Read only access to debug information
> + *
> + * Readable debug information. Debug information is compatible with
> + * kernel lockdown, and does not disclose any sensitive information. For
> + * instance exposing any encryption secrets from this information is
> + * forbidden.
> + */
> + FWCTL_RPC_DEBUG_READ_ONLY = 1,
> + /**
> + * @FWCTL_RPC_DEBUG_WRITE: Writable access to lockdown compatible debug information
> + *
> + * Allows write access to data in the device which may leave a fully
> + * supported state. This is intended to permit intensive and possibly
> + * invasive debugging. This scope will taint the kernel.
> + */
> + FWCTL_RPC_DEBUG_WRITE = 2,
> + /**
> + * @FWCTL_RPC_DEBUG_WRITE_FULL: Writable access to all debug information
> + *
> + * Allows read/write access to everything. Requires CAP_SYS_RAW_IO, so
> + * it is not required to follow lockdown principals. If in doubt
> + * debugging should be placed in this scope. This scope will taint the
> + * kernel.
> + */
> + FWCTL_RPC_DEBUG_WRITE_FULL = 3,
> +};
> +
> +/**
> + * struct fwctl_rpc - ioctl(FWCTL_RPC)
> + * @size: sizeof(struct fwctl_rpc)
> + * @scope: One of enum fwctl_rpc_scope, required scope for the RPC
> + * @in_len: Length of the in memory
> + * @out_len: Length of the out memory
> + * @in: Request message in device specific format
> + * @out: Response message in device specific format
> + *
> + * Deliver a Remote Procedure Call to the device FW and return the response. The
> + * call's parameters and return are marshaled into linear buffers of memory. Any
> + * errno indicates that delivery of the RPC to the device failed. Return status
> + * originating in the device during a successful delivery must be encoded into
> + * out.
> + *
> + * The format of the buffers matches the out_device_type from FWCTL_INFO.
> + */
> +struct fwctl_rpc {
> + __u32 size;
> + __u32 scope;
> + __u32 in_len;
> + __u32 out_len;
> + __aligned_u64 in;
> + __aligned_u64 out;
> +};
> +#define FWCTL_RPC _IO(FWCTL_TYPE, FWCTL_CMD_RPC)
> +
> #endif
> --
> 2.45.2
>
>
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware
2024-08-07 7:44 ` Oded Gabbay
@ 2024-08-08 11:46 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-08-08 11:46 UTC (permalink / raw)
To: Oded Gabbay
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Wed, Aug 07, 2024 at 10:44:21AM +0300, Oded Gabbay wrote:
> Disconnecting the RPCs from the drivers and providing a generic interface will
> allow to have a single user-space library which will be able to communicate
> with the firmware for all the IPs in the device. In Habana's case, it was
> mainly for monitoring and debugging purposes.
Yeah, monitoring and debugging is definately a key use case.
> I do have one question about the rpc ioctl. Are you assuming that the rpc
> is synchronous (we send a message to the firmware and block until we get the
> reply)? If so, what happen if we have an async RPC implementation
> inside the driver? How would you recommend to handle it?
Yes, this is all simplified so the ioctl system call is
synchronous. mlx5 is async under the hood too, it just launches an
async work and waits on a completion before returning from the system
call.
Userspace could multi-thread it, and If people were really keen maybe
there is some io_uring approach.
Thanks,
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 6/8] fwctl: Add documentation
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
` (4 preceding siblings ...)
2024-06-24 22:47 ` [PATCH v2 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-06-25 22:04 ` Randy Dunlap
` (2 more replies)
2024-06-24 22:47 ` [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw Jason Gunthorpe
` (2 subsequent siblings)
8 siblings, 3 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
Document the purpose and rules for the fwctl subsystem.
Link in kdocs to the doc tree.
Nacked-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++
Documentation/userspace-api/index.rst | 1 +
2 files changed, 270 insertions(+)
create mode 100644 Documentation/userspace-api/fwctl.rst
diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
new file mode 100644
index 00000000000000..ece2db2530502f
--- /dev/null
+++ b/Documentation/userspace-api/fwctl.rst
@@ -0,0 +1,269 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===============
+fwctl subsystem
+===============
+
+:Author: Jason Gunthorpe
+
+Overview
+========
+
+Modern devices contain extensive amounts of FW, and in many cases, are largely
+software-defined pieces of hardware. The evolution of this approach is largely a
+reaction to Moore's Law where a chip tape out is now highly expensive, and the
+chip design is extremely large. Replacing fixed HW logic with a flexible and
+tightly coupled FW/HW combination is an effective risk mitigation against chip
+respin. Problems in the HW design can be counteracted in device FW. This is
+especially true for devices which present a stable and backwards compatible
+interface to the operating system driver (such as NVMe).
+
+The FW layer in devices has grown to incredible sizes and devices frequently
+integrate clusters of fast processors to run it. For example, mlx5 devices have
+over 30MB of FW code, and big configurations operate with over 1GB of FW managed
+runtime state.
+
+The availability of such a flexible layer has created quite a variety in the
+industry where single pieces of silicon are now configurable software-defined
+devices and can operate in substantially different ways depending on the need.
+Further, we often see cases where specific sites wish to operate devices in ways
+that are highly specialized and require applications that have been tailored to
+their unique configuration.
+
+Further, devices have become multi-functional and integrated to the point they
+no longer fit neatly into the kernel's division of subsystems. Modern
+multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many
+subsystems while sharing the underlying hardware using the auxiliary device
+system.
+
+All together this creates a challenge for the operating system, where devices
+have an expansive FW environment that needs robust device-specific debugging
+support, and FW-driven functionality that is not well suited to “generic”
+interfaces. fwctl seeks to allow access to the full device functionality from
+user space in the areas of debuggability, management, and first-boot/nth-boot
+provisioning.
+
+fwctl is aimed at the common device design pattern where the OS and FW
+communicate via an RPC message layer constructed with a queue or mailbox scheme.
+In this case the driver will typically have some layer to deliver RPC messages
+and collect RPC responses from device FW. The in-kernel subsystem drivers that
+operate the device for its primary purposes will use these RPCs to build their
+drivers, but devices also usually have a set of ancillary RPCs that don't really
+fit into any specific subsystem. For example, a HW RAID controller is primarily
+operated by the block layer but also comes with a set of RPCs to administer the
+construction of drives within the HW RAID.
+
+In the past when devices were more single function, individual subsystems would
+grow different approaches to solving some of these common problems. For instance
+monitoring device health, manipulating its FLASH, debugging the FW,
+provisioning, all have various unique interfaces across the kernel.
+
+fwctl's purpose is to define a common set of limited rules, described below,
+that allow user space to securely construct and execute RPCs inside device FW.
+The rules serve as an agreement between the operating system and FW on how to
+correctly design the RPC interface. As a uAPI the subsystem provides a thin
+layer of discovery and a generic uAPI to deliver the RPCs and collect the
+response. It supports a system of user space libraries and tools which will
+use this interface to control the device using the device native protocols.
+
+Scope of Action
+---------------
+
+fwctl drivers are strictly restricted to being a way to operate the device FW.
+It is not an avenue to access random kernel internals, or other operating system
+SW states.
+
+fwctl instances must operate on a well-defined device function, and the device
+should have a well-defined security model for what scope within the physical
+device the function is permitted to access. For instance, the most complex PCIe
+device today may broadly have several function-level scopes:
+
+ 1. A privileged function with full access to the on-device global state and
+ configuration
+
+ 2. Multiple hypervisor functions with control over itself and child functions
+ used with VMs
+
+ 3. Multiple VM functions tightly scoped within the VM
+
+The device may create a logical parent/child relationship between these scopes.
+For instance a child VM's FW may be within the scope of the hypervisor FW. It is
+quite common in the VFIO world that the hypervisor environment has a complex
+provisioning/profiling/configuration responsibility for the function VFIO
+assigns to the VM.
+
+Further, within the function, devices often have RPC commands that fall within
+some general scopes of action:
+
+ 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
+ function reset.
+
+ 2. Access to function & child runtime configuration that kernel drivers can
+ discover at runtime.
+
+ 3. Read only access to function debug information that may report on FW objects
+ in the function & child, including FW objects owned by other kernel
+ subsystems.
+
+ 4. Write access to function & child debug information strictly compatible with
+ the principles of kernel lockdown and kernel integrity protection. Triggers
+ a kernel Taint.
+
+ 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
+
+Userspace will provide a scope label on each RPC and the kernel must enforce the
+above CAP's and taints based on that scope. A combination of kernel and FW can
+enforce that RPCs are placed in the correct scope by userspace.
+
+Denied behavior
+---------------
+
+There are many things this interface must not allow user space to do (without a
+Taint or CAP), broadly derived from the principles of kernel lockdown. Some
+examples:
+
+ 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
+ otherwise compromise device or system security and integrity.
+
+ 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
+ objects owned by kernel drivers.
+
+ 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
+ driver can react to the device configuration at function reset/driver load
+ time, but otherwise should not be coupled to fwctl.
+
+ 4. Operate the HW in a way that overlaps with the core purpose of another
+ primary kernel subsystem, such as read/write to LBAs, send/receive of
+ network packets, or operate an accelerator's data plane.
+
+fwctl is not a replacement for device direct access subsystems like uacce or
+VFIO.
+
+fwctl User API
+==============
+
+.. kernel-doc:: include/uapi/fwctl/fwctl.h
+.. kernel-doc:: include/uapi/fwctl/mlx5.h
+
+sysfs Class
+-----------
+
+fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices
+(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device
+operates the iotcl uAPI described above.
+
+fwctl devices can be related to driver components in other subsystems through
+sysfs::
+
+ $ ls /sys/class/fwctl/fwctl0/device/infiniband/
+ ibp0s10f0
+
+ $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
+ fwctl0/
+
+ $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
+ dev device power subsystem uevent
+
+User space Community
+--------------------
+
+Drawing inspiration from nvme-cli, participating in the kernel side must come
+with a user space in a common TBD git tree, at a minimum to usefully operate the
+kernel driver. Providing such an implementation is a pre-condition to merging a
+kernel driver.
+
+The goal is to build user space community around some of the shared problems
+we all have, and ideally develop some common user space programs with some
+starting themes of:
+
+ - Device in-field debugging
+
+ - HW provisioning
+
+ - VFIO child device profiling before VM boot
+
+ - Confidential Compute topics (attestation, secure provisioning)
+
+That stretch across all subsystems in the kernel. fwupd is a great example of
+how an excellent user space experience can emerge out of kernel-side diversity.
+
+fwctl Kernel API
+================
+
+.. kernel-doc:: drivers/fwctl/main.c
+ :export:
+.. kernel-doc:: include/linux/fwctl.h
+
+fwctl Driver design
+-------------------
+
+In many cases a fwctl driver is going to be part of a larger cross-subsystem
+device possibly using the auxiliary_device mechanism. In that case several
+subsystems are going to be sharing the same device and FW interface layer so the
+device design must already provide for isolation and cooperation between kernel
+subsystems. fwctl should fit into that same model.
+
+Part of the driver should include a description of how its scope restrictions
+and security model work. The driver and FW together must ensure that RPCs
+provided by user space are mapped to the appropriate scope. If the validation is
+done in the driver then the validation can read a 'command effects' report from
+the device, or hardwire the enforcement. If the validation is done in the FW,
+then the driver should pass the fwctl_rpc_scope to the FW along with the command.
+
+The driver and FW must cooperate to ensure that either fwctl cannot allocate
+any FW resources, or any resources it does allocate are freed on FD closure. A
+driver primarily constructed around FW RPCs may find that its core PCI function
+and RPC layer belongs under fwctl with auxiliary devices connecting to other
+subsystems.
+
+Each device type must represent a stable FW ABI, such that the userspace
+components have the same general stability we expect from the kernel. FW upgrade
+should not break the userspace tools.
+
+Security Response
+=================
+
+The kernel remains the gatekeeper for this interface. If violations of the
+scopes, security or isolation principles are found, we have options to let
+devices fix them with a FW update, push a kernel patch to parse and block RPC
+commands or push a kernel patch to block entire firmware versions/devices.
+
+While the kernel can always directly parse and restrict RPCs, it is expected
+that the existing kernel pattern of allowing drivers to delegate validation to
+FW to be a useful design.
+
+Existing Similar Examples
+=========================
+
+The approach described in this document is not a new idea. Direct, or near
+direct device access has been offered by the kernel in different areas for
+decades. With more devices wanting to follow this design pattern it is becoming
+clear that it is not entirely well understood and, more importantly, the
+security considerations are not well defined or agreed upon.
+
+Some examples:
+
+ - HW RAID controllers. This includes RPCs to do things like compose drives into
+ a RAID volume, configure RAID parameters, monitor the HW and more.
+
+ - Baseboard managers. RPCs for configuring settings in the device and more
+
+ - NVMe vendor command capsules. nvme-cli provides access to some monitoring
+ functions that different products have defined, but more exists.
+
+ - CXL also has a NVMe-like vendor command system.
+
+ - DRM allows user space drivers to send commands to the device via kernel
+ mediation
+
+ - RDMA allows user space drivers to directly push commands to the device
+ without kernel involvement
+
+ - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
+
+The first 4 are examples of areas that fwctl intends to cover.
+
+Some key lessons learned from these past efforts are the importance of having a
+common user space project to use as a pre-condition for obtaining a kernel
+driver. Developing good community around useful software in user space is key to
+getting companies to fund participation to enable their products.
diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
index 8a251d71fa6e14..990b4c0710c99e 100644
--- a/Documentation/userspace-api/index.rst
+++ b/Documentation/userspace-api/index.rst
@@ -44,6 +44,7 @@ Devices and I/O
accelerators/ocxl
dma-buf-alloc-exchange
+ fwctl
gpio/index
iommu
iommufd
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-06-24 22:47 ` [PATCH v2 6/8] fwctl: Add documentation Jason Gunthorpe
@ 2024-06-25 22:04 ` Randy Dunlap
2024-07-22 16:18 ` Jason Gunthorpe
2024-07-26 15:50 ` Jonathan Cameron
2024-08-06 8:03 ` Daniel Vetter
2 siblings, 1 reply; 48+ messages in thread
From: Randy Dunlap @ 2024-06-25 22:04 UTC (permalink / raw)
To: Jason Gunthorpe, Jonathan Corbet, Itay Avraham, Jakub Kicinski,
Leon Romanovsky, linux-doc, linux-rdma, netdev, Paolo Abeni,
Saeed Mahameed, Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
On 6/24/24 3:47 PM, Jason Gunthorpe wrote:
> Document the purpose and rules for the fwctl subsystem.
>
> Link in kdocs to the doc tree.
>
> Nacked-by: Jakub Kicinski <kuba@kernel.org>
> Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
> Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++
> Documentation/userspace-api/index.rst | 1 +
> 2 files changed, 270 insertions(+)
> create mode 100644 Documentation/userspace-api/fwctl.rst
>
> diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> new file mode 100644
> index 00000000000000..ece2db2530502f
> --- /dev/null
> +++ b/Documentation/userspace-api/fwctl.rst
> @@ -0,0 +1,269 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============
> +fwctl subsystem
> +===============
> +
> +:Author: Jason Gunthorpe
> +
> +Overview
> +========
> +
> +Modern devices contain extensive amounts of FW, and in many cases, are largely
> +software-defined pieces of hardware. The evolution of this approach is largely a
> +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> +chip design is extremely large. Replacing fixed HW logic with a flexible and
> +tightly coupled FW/HW combination is an effective risk mitigation against chip
> +respin. Problems in the HW design can be counteracted in device FW. This is
> +especially true for devices which present a stable and backwards compatible
> +interface to the operating system driver (such as NVMe).
> +
> +The FW layer in devices has grown to incredible sizes and devices frequently
> +integrate clusters of fast processors to run it. For example, mlx5 devices have
> +over 30MB of FW code, and big configurations operate with over 1GB of FW managed
> +runtime state.
> +
> +The availability of such a flexible layer has created quite a variety in the
> +industry where single pieces of silicon are now configurable software-defined
> +devices and can operate in substantially different ways depending on the need.
> +Further, we often see cases where specific sites wish to operate devices in ways
> +that are highly specialized and require applications that have been tailored to
> +their unique configuration.
> +
> +Further, devices have become multi-functional and integrated to the point they
> +no longer fit neatly into the kernel's division of subsystems. Modern
> +multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many
> +subsystems while sharing the underlying hardware using the auxiliary device
> +system.
> +
> +All together this creates a challenge for the operating system, where devices
> +have an expansive FW environment that needs robust device-specific debugging
> +support, and FW-driven functionality that is not well suited to “generic”
> +interfaces. fwctl seeks to allow access to the full device functionality from
> +user space in the areas of debuggability, management, and first-boot/nth-boot
> +provisioning.
> +
> +fwctl is aimed at the common device design pattern where the OS and FW
> +communicate via an RPC message layer constructed with a queue or mailbox scheme.
> +In this case the driver will typically have some layer to deliver RPC messages
> +and collect RPC responses from device FW. The in-kernel subsystem drivers that
> +operate the device for its primary purposes will use these RPCs to build their
> +drivers, but devices also usually have a set of ancillary RPCs that don't really
> +fit into any specific subsystem. For example, a HW RAID controller is primarily
> +operated by the block layer but also comes with a set of RPCs to administer the
> +construction of drives within the HW RAID.
> +
> +In the past when devices were more single function, individual subsystems would
> +grow different approaches to solving some of these common problems. For instance
> +monitoring device health, manipulating its FLASH, debugging the FW,
> +provisioning, all have various unique interfaces across the kernel.
> +
> +fwctl's purpose is to define a common set of limited rules, described below,
> +that allow user space to securely construct and execute RPCs inside device FW.
> +The rules serve as an agreement between the operating system and FW on how to
> +correctly design the RPC interface. As a uAPI the subsystem provides a thin
> +layer of discovery and a generic uAPI to deliver the RPCs and collect the
> +response. It supports a system of user space libraries and tools which will
> +use this interface to control the device using the device native protocols.
> +
> +Scope of Action
> +---------------
> +
> +fwctl drivers are strictly restricted to being a way to operate the device FW.
> +It is not an avenue to access random kernel internals, or other operating system
> +SW states.
> +
> +fwctl instances must operate on a well-defined device function, and the device
> +should have a well-defined security model for what scope within the physical
> +device the function is permitted to access. For instance, the most complex PCIe
> +device today may broadly have several function-level scopes:
> +
> + 1. A privileged function with full access to the on-device global state and
> + configuration
> +
> + 2. Multiple hypervisor functions with control over itself and child functions
> + used with VMs
> +
> + 3. Multiple VM functions tightly scoped within the VM
> +
> +The device may create a logical parent/child relationship between these scopes.
> +For instance a child VM's FW may be within the scope of the hypervisor FW. It is
> +quite common in the VFIO world that the hypervisor environment has a complex
> +provisioning/profiling/configuration responsibility for the function VFIO
> +assigns to the VM.
> +
> +Further, within the function, devices often have RPC commands that fall within
> +some general scopes of action:
> +
> + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
etc.
> + function reset.
> +
> + 2. Access to function & child runtime configuration that kernel drivers can
> + discover at runtime.
> +
> + 3. Read only access to function debug information that may report on FW objects
Read-only
> + in the function & child, including FW objects owned by other kernel
> + subsystems.
> +
> + 4. Write access to function & child debug information strictly compatible with
> + the principles of kernel lockdown and kernel integrity protection. Triggers
> + a kernel Taint.
> +
> + 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
> +
> +Userspace will provide a scope label on each RPC and the kernel must enforce the
Some places (above/below here) say "user space" instead of "userspace". Please choose one
and stick with it.
> +above CAP's and taints based on that scope. A combination of kernel and FW can
CAPs
> +enforce that RPCs are placed in the correct scope by userspace.
> +
> +Denied behavior
> +---------------
> +
> +There are many things this interface must not allow user space to do (without a
> +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
> +examples:
> +
> + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
An RPC message is going to run code in the device. Should this say something instead
like:
download [or load] code to be executed in the device,
> + otherwise compromise device or system security and integrity.
> +
> + 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
> + objects owned by kernel drivers.
> +
> + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> + driver can react to the device configuration at function reset/driver load
> + time, but otherwise should not be coupled to fwctl.
> +
> + 4. Operate the HW in a way that overlaps with the core purpose of another
> + primary kernel subsystem, such as read/write to LBAs, send/receive of
> + network packets, or operate an accelerator's data plane.
> +
> +fwctl is not a replacement for device direct access subsystems like uacce or
> +VFIO.
> +
> +fwctl User API
> +==============
> +
> +.. kernel-doc:: include/uapi/fwctl/fwctl.h
> +.. kernel-doc:: include/uapi/fwctl/mlx5.h
> +
> +sysfs Class
> +-----------
> +
> +fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices
> +(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device
> +operates the iotcl uAPI described above.
> +
> +fwctl devices can be related to driver components in other subsystems through
> +sysfs::
> +
> + $ ls /sys/class/fwctl/fwctl0/device/infiniband/
> + ibp0s10f0
> +
> + $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
> + fwctl0/
> +
> + $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
> + dev device power subsystem uevent
> +
> +User space Community
> +--------------------
> +
> +Drawing inspiration from nvme-cli, participating in the kernel side must come
> +with a user space in a common TBD git tree, at a minimum to usefully operate the
> +kernel driver. Providing such an implementation is a pre-condition to merging a
> +kernel driver.
> +
> +The goal is to build user space community around some of the shared problems
> +we all have, and ideally develop some common user space programs with some
> +starting themes of:
> +
> + - Device in-field debugging
> +
> + - HW provisioning
> +
> + - VFIO child device profiling before VM boot
> +
> + - Confidential Compute topics (attestation, secure provisioning)
> +
> +That stretch across all subsystems in the kernel. fwupd is a great example of
that
> +how an excellent user space experience can emerge out of kernel-side diversity.
> +
> +fwctl Kernel API
> +================
> +
> +.. kernel-doc:: drivers/fwctl/main.c
> + :export:
> +.. kernel-doc:: include/linux/fwctl.h
> +
> +fwctl Driver design
> +-------------------
> +
> +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> +device possibly using the auxiliary_device mechanism. In that case several
> +subsystems are going to be sharing the same device and FW interface layer so the
> +device design must already provide for isolation and cooperation between kernel
> +subsystems. fwctl should fit into that same model.
> +
> +Part of the driver should include a description of how its scope restrictions
> +and security model work. The driver and FW together must ensure that RPCs
> +provided by user space are mapped to the appropriate scope. If the validation is
> +done in the driver then the validation can read a 'command effects' report from
> +the device, or hardwire the enforcement. If the validation is done in the FW,
> +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> +
> +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> +any FW resources, or any resources it does allocate are freed on FD closure. A
> +driver primarily constructed around FW RPCs may find that its core PCI function
> +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> +subsystems.
> +
> +Each device type must represent a stable FW ABI, such that the userspace
> +components have the same general stability we expect from the kernel. FW upgrade
> +should not break the userspace tools.
> +
> +Security Response
> +=================
> +
> +The kernel remains the gatekeeper for this interface. If violations of the
> +scopes, security or isolation principles are found, we have options to let
> +devices fix them with a FW update, push a kernel patch to parse and block RPC
fwctl does not do FW updates, is that correct?
> +commands or push a kernel patch to block entire firmware versions/devices.
> +
> +While the kernel can always directly parse and restrict RPCs, it is expected
> +that the existing kernel pattern of allowing drivers to delegate validation to
> +FW to be a useful design.
> +
> +Existing Similar Examples
> +=========================
> +
> +The approach described in this document is not a new idea. Direct, or near
> +direct device access has been offered by the kernel in different areas for
> +decades. With more devices wanting to follow this design pattern it is becoming
> +clear that it is not entirely well understood and, more importantly, the
> +security considerations are not well defined or agreed upon.
> +
> +Some examples:
> +
> + - HW RAID controllers. This includes RPCs to do things like compose drives into
> + a RAID volume, configure RAID parameters, monitor the HW and more.
> +
> + - Baseboard managers. RPCs for configuring settings in the device and more
> +
> + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> + functions that different products have defined, but more exists.
exist.
> +
> + - CXL also has a NVMe-like vendor command system.
> +
> + - DRM allows user space drivers to send commands to the device via kernel
> + mediation
> +
> + - RDMA allows user space drivers to directly push commands to the device
> + without kernel involvement
> +
> + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
etc.
> +
> +The first 4 are examples of areas that fwctl intends to cover.
> +
> +Some key lessons learned from these past efforts are the importance of having a
> +common user space project to use as a pre-condition for obtaining a kernel
> +driver. Developing good community around useful software in user space is key to
> +getting companies to fund participation to enable their products.
--
~Randy
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-06-25 22:04 ` Randy Dunlap
@ 2024-07-22 16:18 ` Jason Gunthorpe
2024-07-22 20:40 ` Randy Dunlap
0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-22 16:18 UTC (permalink / raw)
To: Randy Dunlap
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Tue, Jun 25, 2024 at 03:04:42PM -0700, Randy Dunlap wrote:
> > +There are many things this interface must not allow user space to do (without a
> > +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
> > +examples:
> > +
> > + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
>
> An RPC message is going to run code in the device. Should this say something instead
> like:
>
> download [or load] code to be executed in the device,
Yeah, it is a hard concept. It is kind of murky as even today's
devlink flash will let you load untrusted code into the device under
lockdown AFAICR.
How about:
1. DMA to/from arbitrary memory, hang the system, compromise FW integrity with
untrusted code, or otherwise compromise device or system security and
integrity.
Which is a little broader I suppose.
> > +The kernel remains the gatekeeper for this interface. If violations of the
> > +scopes, security or isolation principles are found, we have options to let
> > +devices fix them with a FW update, push a kernel patch to parse and block RPC
>
> fwctl does not do FW updates, is that correct?
I think it is up to the specific RPCs the device supports. Given there
is currently no way to marshal a large amount of data it is not a good
interface for FW update.
I'd encourage people to use devlink flash more broadly, but I also
wouldn't go out of the way to block FW update RPCs that might exist
from here.
I certainly wouldn't want people to make their own FW update ioctls
(as still seems to be happening) out of fear they shouldn't use
fwctl :\
Looking particularly at mlx5, we've had devlink flash for a long time
now, but it hasn't suceeded to displace the mlx5 specific tools, for
whatever reason.
I grabbed all the changes here thanks!
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-07-22 16:18 ` Jason Gunthorpe
@ 2024-07-22 20:40 ` Randy Dunlap
0 siblings, 0 replies; 48+ messages in thread
From: Randy Dunlap @ 2024-07-22 20:40 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
Hi,
On 7/22/24 9:18 AM, Jason Gunthorpe wrote:
> On Tue, Jun 25, 2024 at 03:04:42PM -0700, Randy Dunlap wrote:
>>> +There are many things this interface must not allow user space to do (without a
>>> +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
>>> +examples:
>>> +
>>> + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
>>
>> An RPC message is going to run code in the device. Should this say something instead
>> like:
>>
>> download [or load] code to be executed in the device,
>
> Yeah, it is a hard concept. It is kind of murky as even today's
> devlink flash will let you load untrusted code into the device under
> lockdown AFAICR.
>
> How about:
>
> 1. DMA to/from arbitrary memory, hang the system, compromise FW integrity with
> untrusted code, or otherwise compromise device or system security and
> integrity.
>
> Which is a little broader I suppose.
OK, somewhat better.
>>> +The kernel remains the gatekeeper for this interface. If violations of the
>>> +scopes, security or isolation principles are found, we have options to let
>>> +devices fix them with a FW update, push a kernel patch to parse and block RPC
>>
>> fwctl does not do FW updates, is that correct?
>
> I think it is up to the specific RPCs the device supports. Given there
> is currently no way to marshal a large amount of data it is not a good
> interface for FW update.
>
> I'd encourage people to use devlink flash more broadly, but I also
> wouldn't go out of the way to block FW update RPCs that might exist
> from here.
>
> I certainly wouldn't want people to make their own FW update ioctls
> (as still seems to be happening) out of fear they shouldn't use
> fwctl :\
fair enough.
> Looking particularly at mlx5, we've had devlink flash for a long time
> now, but it hasn't suceeded to displace the mlx5 specific tools, for
> whatever reason.
>
> I grabbed all the changes here thanks!
--
~Randy
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-06-24 22:47 ` [PATCH v2 6/8] fwctl: Add documentation Jason Gunthorpe
2024-06-25 22:04 ` Randy Dunlap
@ 2024-07-26 15:50 ` Jonathan Cameron
2024-07-29 16:11 ` Jason Gunthorpe
2024-08-06 8:03 ` Daniel Vetter
2 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-26 15:50 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
> diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> new file mode 100644
> index 00000000000000..ece2db2530502f
> --- /dev/null
> +++ b/Documentation/userspace-api/fwctl.rst
> @@ -0,0 +1,269 @@
> +Overview
> +========
> +
> +Modern devices contain extensive amounts of FW, and in many cases, are largely
FW and, in many cases, are
> +software-defined pieces of hardware. The evolution of this approach is largely a
> +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> +chip design is extremely large. Replacing fixed HW logic with a flexible and
> +tightly coupled FW/HW combination is an effective risk mitigation against chip
> +respin. Problems in the HW design can be counteracted in device FW. This is
> +especially true for devices which present a stable and backwards compatible
> +interface to the operating system driver (such as NVMe).
...
The document lays out where this sits well.
Jonathan
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-07-26 15:50 ` Jonathan Cameron
@ 2024-07-29 16:11 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-29 16:11 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Fri, Jul 26, 2024 at 04:50:21PM +0100, Jonathan Cameron wrote:
>
> > diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> > new file mode 100644
> > index 00000000000000..ece2db2530502f
> > --- /dev/null
> > +++ b/Documentation/userspace-api/fwctl.rst
> > @@ -0,0 +1,269 @@
>
> > +Overview
> > +========
> > +
> > +Modern devices contain extensive amounts of FW, and in many cases, are largely
>
> FW and, in many cases, are
Yep, Randy noted it too
> > +software-defined pieces of hardware. The evolution of this approach is largely a
> > +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> > +chip design is extremely large. Replacing fixed HW logic with a flexible and
> > +tightly coupled FW/HW combination is an effective risk mitigation against chip
> > +respin. Problems in the HW design can be counteracted in device FW. This is
> > +especially true for devices which present a stable and backwards compatible
> > +interface to the operating system driver (such as NVMe).
>
> ...
>
> The document lays out where this sits well.
Thanks!
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-06-24 22:47 ` [PATCH v2 6/8] fwctl: Add documentation Jason Gunthorpe
2024-06-25 22:04 ` Randy Dunlap
2024-07-26 15:50 ` Jonathan Cameron
@ 2024-08-06 8:03 ` Daniel Vetter
2024-08-08 12:24 ` Jason Gunthorpe
2 siblings, 1 reply; 48+ messages in thread
From: Daniel Vetter @ 2024-08-06 8:03 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, Jun 24, 2024 at 07:47:30PM -0300, Jason Gunthorpe wrote:
> Document the purpose and rules for the fwctl subsystem.
>
> Link in kdocs to the doc tree.
>
> Nacked-by: Jakub Kicinski <kuba@kernel.org>
> Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
I think we'll need something like fwctl rather sooner than later for gpus
too, so for all the fwctl patches up to this one:
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
I both really liked the approach to fwctl_unregister and
copy_struct_from_user, and didn't spot anything offensive in the code.
Bunch of optional thoughs below, my take at least from the drm point is
that it's better to get this going and learn as we do, than try to get
this perfect from the go. Some of the nuances will need to be informed by
practice anyway.
-Sima
> ---
> Documentation/userspace-api/fwctl.rst | 269 ++++++++++++++++++++++++++
> Documentation/userspace-api/index.rst | 1 +
> 2 files changed, 270 insertions(+)
> create mode 100644 Documentation/userspace-api/fwctl.rst
>
> diff --git a/Documentation/userspace-api/fwctl.rst b/Documentation/userspace-api/fwctl.rst
> new file mode 100644
> index 00000000000000..ece2db2530502f
> --- /dev/null
> +++ b/Documentation/userspace-api/fwctl.rst
> @@ -0,0 +1,269 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +===============
> +fwctl subsystem
> +===============
> +
> +:Author: Jason Gunthorpe
> +
> +Overview
> +========
> +
> +Modern devices contain extensive amounts of FW, and in many cases, are largely
> +software-defined pieces of hardware. The evolution of this approach is largely a
> +reaction to Moore's Law where a chip tape out is now highly expensive, and the
> +chip design is extremely large. Replacing fixed HW logic with a flexible and
> +tightly coupled FW/HW combination is an effective risk mitigation against chip
> +respin. Problems in the HW design can be counteracted in device FW. This is
> +especially true for devices which present a stable and backwards compatible
> +interface to the operating system driver (such as NVMe).
> +
> +The FW layer in devices has grown to incredible sizes and devices frequently
> +integrate clusters of fast processors to run it. For example, mlx5 devices have
> +over 30MB of FW code, and big configurations operate with over 1GB of FW managed
> +runtime state.
> +
> +The availability of such a flexible layer has created quite a variety in the
> +industry where single pieces of silicon are now configurable software-defined
> +devices and can operate in substantially different ways depending on the need.
> +Further, we often see cases where specific sites wish to operate devices in ways
> +that are highly specialized and require applications that have been tailored to
> +their unique configuration.
> +
> +Further, devices have become multi-functional and integrated to the point they
> +no longer fit neatly into the kernel's division of subsystems. Modern
> +multi-functional devices have drivers, such as bnxt/ice/mlx5/pds, that span many
> +subsystems while sharing the underlying hardware using the auxiliary device
> +system.
> +
> +All together this creates a challenge for the operating system, where devices
> +have an expansive FW environment that needs robust device-specific debugging
> +support, and FW-driven functionality that is not well suited to “generic”
> +interfaces. fwctl seeks to allow access to the full device functionality from
> +user space in the areas of debuggability, management, and first-boot/nth-boot
> +provisioning.
> +
> +fwctl is aimed at the common device design pattern where the OS and FW
> +communicate via an RPC message layer constructed with a queue or mailbox scheme.
> +In this case the driver will typically have some layer to deliver RPC messages
> +and collect RPC responses from device FW. The in-kernel subsystem drivers that
> +operate the device for its primary purposes will use these RPCs to build their
> +drivers, but devices also usually have a set of ancillary RPCs that don't really
> +fit into any specific subsystem. For example, a HW RAID controller is primarily
> +operated by the block layer but also comes with a set of RPCs to administer the
> +construction of drives within the HW RAID.
> +
> +In the past when devices were more single function, individual subsystems would
> +grow different approaches to solving some of these common problems. For instance
> +monitoring device health, manipulating its FLASH, debugging the FW,
> +provisioning, all have various unique interfaces across the kernel.
> +
> +fwctl's purpose is to define a common set of limited rules, described below,
> +that allow user space to securely construct and execute RPCs inside device FW.
> +The rules serve as an agreement between the operating system and FW on how to
> +correctly design the RPC interface. As a uAPI the subsystem provides a thin
> +layer of discovery and a generic uAPI to deliver the RPCs and collect the
> +response. It supports a system of user space libraries and tools which will
> +use this interface to control the device using the device native protocols.
> +
> +Scope of Action
> +---------------
> +
> +fwctl drivers are strictly restricted to being a way to operate the device FW.
> +It is not an avenue to access random kernel internals, or other operating system
> +SW states.
> +
> +fwctl instances must operate on a well-defined device function, and the device
> +should have a well-defined security model for what scope within the physical
> +device the function is permitted to access. For instance, the most complex PCIe
> +device today may broadly have several function-level scopes:
> +
> + 1. A privileged function with full access to the on-device global state and
> + configuration
> +
> + 2. Multiple hypervisor functions with control over itself and child functions
> + used with VMs
> +
> + 3. Multiple VM functions tightly scoped within the VM
> +
> +The device may create a logical parent/child relationship between these scopes.
> +For instance a child VM's FW may be within the scope of the hypervisor FW. It is
> +quite common in the VFIO world that the hypervisor environment has a complex
> +provisioning/profiling/configuration responsibility for the function VFIO
> +assigns to the VM.
> +
> +Further, within the function, devices often have RPC commands that fall within
> +some general scopes of action:
In your fwctl_rpc_scope you only have 4, not 5, and I think that's
clearer. Might also be good to put a kerneldoc link to the enum in here
for the details.
> + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
> + function reset.
> +
> + 2. Access to function & child runtime configuration that kernel drivers can
> + discover at runtime.
This one worries me, since it has potential for people to get it very
wrong and e.g. expose configuration where it's only safe if the driver
isn't bound, or at least no userspace is using it. I'd drop this and just
leave the one configuration rpc, with maybe more detail what exactly "When
configuration is written to the device remains in a fully supported
state." from the kerneldoc means. I think only safe options woulb be a)
applied on function reset b) transparent to both kernel and userspace
(beyond maybe device performance).
That might not cut all configuration items, but for those I think it'd be
best if fwctl guarantees through the driver model that there's no driver
bound to that function (or used by vfio/kvm), to guarantee safety. So
explicitly split them out as runtime configuration with a distinct rpc
scope. Maybe an addition for later.
> +
> + 3. Read only access to function debug information that may report on FW objects
> + in the function & child, including FW objects owned by other kernel
> + subsystems.
> +
> + 4. Write access to function & child debug information strictly compatible with
> + the principles of kernel lockdown and kernel integrity protection. Triggers
> + a kernel Taint.
> +
> + 5. Full debug device access. Triggers a kernel Taint, requires CAP_SYS_RAWIO.
> +
> +Userspace will provide a scope label on each RPC and the kernel must enforce the
> +above CAP's and taints based on that scope. A combination of kernel and FW can
> +enforce that RPCs are placed in the correct scope by userspace.
> +
> +Denied behavior
> +---------------
> +
> +There are many things this interface must not allow user space to do (without a
> +Taint or CAP), broadly derived from the principles of kernel lockdown. Some
> +examples:
> +
> + 1. DMA to/from arbitrary memory, hang the system, run code in the device, or
> + otherwise compromise device or system security and integrity.
> +
> + 2. Provide an abnormal “back door” to kernel drivers. No manipulation of kernel
> + objects owned by kernel drivers.
> +
> + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> + driver can react to the device configuration at function reset/driver load
> + time, but otherwise should not be coupled to fwctl.
Kinda the same worry as above, I think this should be "... but otherwise
_must_ not be coupled to fwctl".
> + 4. Operate the HW in a way that overlaps with the core purpose of another
> + primary kernel subsystem, such as read/write to LBAs, send/receive of
> + network packets, or operate an accelerator's data plane.
I still think some words about what to do when fwctl exposes some
functional which later on is covered by a newly added subsystem that
didn't yet exist. Also maybe adding some more examples from the RAS side
of things, since that's come up a few time in the ksummit-discuss thread,
plus I think it's where we'll most likely have a new subsystem or extended
functionality of an existing one pop up and cause conflicts with fwctl
rpcs that have landed earlier.
I'm personally fine with "we'll figure that out when it happens."
> +
> +fwctl is not a replacement for device direct access subsystems like uacce or
> +VFIO.
> +
> +fwctl User API
> +==============
> +
> +.. kernel-doc:: include/uapi/fwctl/fwctl.h
> +.. kernel-doc:: include/uapi/fwctl/mlx5.h
> +
> +sysfs Class
> +-----------
> +
> +fwctl has a sysfs class (/sys/class/fwctl/fwctlNN/) and character devices
> +(/dev/fwctl/fwctlNN) with a simple numbered scheme. The character device
> +operates the iotcl uAPI described above.
> +
> +fwctl devices can be related to driver components in other subsystems through
> +sysfs::
> +
> + $ ls /sys/class/fwctl/fwctl0/device/infiniband/
> + ibp0s10f0
> +
> + $ ls /sys/class/infiniband/ibp0s10f0/device/fwctl/
> + fwctl0/
> +
> + $ ls /sys/devices/pci0000:00/0000:00:0a.0/fwctl/fwctl0
> + dev device power subsystem uevent
> +
> +User space Community
> +--------------------
> +
> +Drawing inspiration from nvme-cli, participating in the kernel side must come
> +with a user space in a common TBD git tree, at a minimum to usefully operate the
> +kernel driver. Providing such an implementation is a pre-condition to merging a
> +kernel driver.
> +
> +The goal is to build user space community around some of the shared problems
> +we all have, and ideally develop some common user space programs with some
> +starting themes of:
> +
> + - Device in-field debugging
> +
> + - HW provisioning
> +
> + - VFIO child device profiling before VM boot
> +
> + - Confidential Compute topics (attestation, secure provisioning)
> +
> +That stretch across all subsystems in the kernel. fwupd is a great example of
> +how an excellent user space experience can emerge out of kernel-side diversity.
> +
> +fwctl Kernel API
> +================
> +
> +.. kernel-doc:: drivers/fwctl/main.c
> + :export:
> +.. kernel-doc:: include/linux/fwctl.h
> +
> +fwctl Driver design
> +-------------------
> +
> +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> +device possibly using the auxiliary_device mechanism. In that case several
> +subsystems are going to be sharing the same device and FW interface layer so the
> +device design must already provide for isolation and cooperation between kernel
> +subsystems. fwctl should fit into that same model.
> +
> +Part of the driver should include a description of how its scope restrictions
> +and security model work. The driver and FW together must ensure that RPCs
> +provided by user space are mapped to the appropriate scope. If the validation is
> +done in the driver then the validation can read a 'command effects' report from
> +the device, or hardwire the enforcement. If the validation is done in the FW,
> +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> +
> +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> +any FW resources, or any resources it does allocate are freed on FD closure. A
> +driver primarily constructed around FW RPCs may find that its core PCI function
> +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> +subsystems.
> +
> +Each device type must represent a stable FW ABI, such that the userspace
> +components have the same general stability we expect from the kernel. FW upgrade
> +should not break the userspace tools.
I think an exception for the debug rpcs (or maybe only
FWCTL_DEBUG_WRITE_FULL if we're super strict) could really help the case
for fwctl. Still not allowing to break individual rpcs, but maybe allow fw
to remove outdated ones. With gpu fw we already struggle with abi
breakages where the kernel driver can do some amount of impendance
mismatch. If this is extended to debug tooling I fear it just wont happen,
forcing big junks of what fwctl could support to stay out of tree.
And especially for field and even more in-house debug tooling, you really
want the userspace version matching your fw anyway.
Currently that mess tends to leave in debugfs and/or out-of-tree, so
there's no stable uapi guarantee anyway. And I don't see the point in
requiring it - if there is a need for stabling tooling, maybe that
indicates more the need for a new subsystem that standardized things
across devices/vendors.
Another case in point are tracepoints, where the stable abi question also
has a lot more nuanced answer. Debug fw support imo falls into that same
bucket.
> +
> +Security Response
> +=================
> +
> +The kernel remains the gatekeeper for this interface. If violations of the
> +scopes, security or isolation principles are found, we have options to let
> +devices fix them with a FW update, push a kernel patch to parse and block RPC
> +commands or push a kernel patch to block entire firmware versions/devices.
> +
> +While the kernel can always directly parse and restrict RPCs, it is expected
> +that the existing kernel pattern of allowing drivers to delegate validation to
> +FW to be a useful design.
> +
> +Existing Similar Examples
> +=========================
> +
> +The approach described in this document is not a new idea. Direct, or near
> +direct device access has been offered by the kernel in different areas for
> +decades. With more devices wanting to follow this design pattern it is becoming
> +clear that it is not entirely well understood and, more importantly, the
> +security considerations are not well defined or agreed upon.
> +
> +Some examples:
> +
> + - HW RAID controllers. This includes RPCs to do things like compose drives into
> + a RAID volume, configure RAID parameters, monitor the HW and more.
> +
> + - Baseboard managers. RPCs for configuring settings in the device and more
> +
> + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> + functions that different products have defined, but more exists.
> +
> + - CXL also has a NVMe-like vendor command system.
> +
> + - DRM allows user space drivers to send commands to the device via kernel
> + mediation
> +
> + - RDMA allows user space drivers to directly push commands to the device
> + without kernel involvement
> +
> + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
> +
> +The first 4 are examples of areas that fwctl intends to cover.
Maybe add a sentence here why the latter 3 aren't, just to strengthen that
point?
> +
> +Some key lessons learned from these past efforts are the importance of having a
> +common user space project to use as a pre-condition for obtaining a kernel
> +driver. Developing good community around useful software in user space is key to
> +getting companies to fund participation to enable their products.
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index 8a251d71fa6e14..990b4c0710c99e 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -44,6 +44,7 @@ Devices and I/O
>
> accelerators/ocxl
> dma-buf-alloc-exchange
> + fwctl
> gpio/index
> iommu
> iommufd
> --
> 2.45.2
>
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 48+ messages in thread
* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-08-06 8:03 ` Daniel Vetter
@ 2024-08-08 12:24 ` Jason Gunthorpe
2024-08-09 9:21 ` Daniel Vetter
0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-08-08 12:24 UTC (permalink / raw)
To: Daniel Vetter
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Tue, Aug 06, 2024 at 10:03:36AM +0200, Daniel Vetter wrote:
> On Mon, Jun 24, 2024 at 07:47:30PM -0300, Jason Gunthorpe wrote:
> > Document the purpose and rules for the fwctl subsystem.
> >
> > Link in kdocs to the doc tree.
> >
> > Nacked-by: Jakub Kicinski <kuba@kernel.org>
> > Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
>
> I think we'll need something like fwctl rather sooner than later for gpus
> too, so for all the fwctl patches up to this one:
Yes, I think so as well. You can see it already in the various GPU out
of tree stuff where there is usually some expansive
monitoring/debug/provisioning tool there too.
> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Thanks!
> I both really liked the approach to fwctl_unregister and
> copy_struct_from_user, and didn't spot anything offensive in the code.
It is copied from iommufd which copied concepts from the fixedup modern
rdma :) Proven over a few years now.
> > +Further, within the function, devices often have RPC commands that fall within
> > +some general scopes of action:
>
> In your fwctl_rpc_scope you only have 4, not 5, and I think that's
> clearer. Might also be good to put a kerneldoc link to the enum in here
> for the details.
I bundled these two together in the enum FWCTL_RPC_CONFIGURATION:
> > + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
> > + function reset.
> > +
> > + 2. Access to function & child runtime configuration that kernel drivers can
> > + discover at runtime.
>
> This one worries me, since it has potential for people to get it very
> wrong and e.g. expose configuration where it's only safe if the driver
> isn't bound, or at least no userspace is using it.
The notion was "at runtime" meaning any active user of the device
would either not be aware of whatever change or already have some way
to learn about it.
Especially when we think about child configuration - ie configuration
of a VF from the hypervisor while a VM is running - there can be
useful things that fit under this category. For instance you might
throttle the device to support live migration.
Throttling is a really complex topic for an autonomous device like
GPU/RDMA.
> I'd drop this and just
> leave the one configuration rpc, with maybe more detail what exactly "When
> configuration is written to the device remains in a fully supported
> state."
Ah, this language is incorporating the distro feedback around loosing
the ability to support the system.
> from the kerneldoc means. I think only safe options woulb be a)
> applied on function reset b) transparent to both kernel and userspace
> (beyond maybe device performance).
Let's lean into transparent more:
1. Access to function & child configuration, FLASH, etc. that becomes live at a
function reset. Access to function & child runtime configuration that is
transparent or non-disruptive to any driver or VM.
> That might not cut all configuration items, but for those I think it'd be
> best if fwctl guarantees through the driver model that there's no driver
> bound to that function (or used by vfio/kvm), to guarantee safety. So
> explicitly split them out as runtime configuration with a distinct rpc
> scope. Maybe an addition for later.
No driver bound is too strong. VFIO and even mlx5_core can trigger a
FLR while still bound in a controlled way. Taking effect at FLR time
is a reasonable restriction.
Like if you reconfigure a child VF and then start a VM there will be
an automatic FLR in that process that can make the updated VF
configuration active.
> > + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> > + driver can react to the device configuration at function reset/driver load
> > + time, but otherwise should not be coupled to fwctl.
>
> Kinda the same worry as above, I think this should be "... but otherwise
> _must_ not be coupled to fwctl".
Yep
> > + 4. Operate the HW in a way that overlaps with the core purpose of another
> > + primary kernel subsystem, such as read/write to LBAs, send/receive of
> > + network packets, or operate an accelerator's data plane.
>
> I still think some words about what to do when fwctl exposes some
> functional which later on is covered by a newly added subsystem that
> didn't yet exist. Also maybe adding some more examples from the RAS side
> of things, since that's come up a few time in the ksummit-discuss thread,
> plus I think it's where we'll most likely have a new subsystem or extended
> functionality of an existing one pop up and cause conflicts with fwctl
> rpcs that have landed earlier.
How about:
Operations exposed through fwctl's non-taining interfaces should be fully
sharable with other users of the device. For instance exposing a RPC through
fwctl should never prevent a kernel subsystem from also concurrently using that
same RPC or hardware unit down the road. In such cases fwctl will be less
important than proper kernel subsystems that eventually emerge. Mistakes in this
area resulting in clashes will be resolved in favour of a kernel implementation.
> > +fwctl Driver design
> > +-------------------
> > +
> > +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> > +device possibly using the auxiliary_device mechanism. In that case several
> > +subsystems are going to be sharing the same device and FW interface layer so the
> > +device design must already provide for isolation and cooperation between kernel
> > +subsystems. fwctl should fit into that same model.
> > +
> > +Part of the driver should include a description of how its scope restrictions
> > +and security model work. The driver and FW together must ensure that RPCs
> > +provided by user space are mapped to the appropriate scope. If the validation is
> > +done in the driver then the validation can read a 'command effects' report from
> > +the device, or hardwire the enforcement. If the validation is done in the FW,
> > +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> > +
> > +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> > +any FW resources, or any resources it does allocate are freed on FD closure. A
> > +driver primarily constructed around FW RPCs may find that its core PCI function
> > +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> > +subsystems.
> > +
> > +Each device type must represent a stable FW ABI, such that the userspace
> > +components have the same general stability we expect from the kernel. FW upgrade
> > +should not break the userspace tools.
>
> I think an exception for the debug rpcs (or maybe only
> FWCTL_DEBUG_WRITE_FULL if we're super strict) could really help the case
> for fwctl. Still not allowing to break individual rpcs, but maybe allow fw
> to remove outdated ones.
I'm definitely mindful of Linus's pragmatic view of ABI stability, where
existing tools that *people actually use* shouldn't break. You
shouldn't have to upgrade your tools when you upgrade your FW.
I think it is important to convey that as a the gold standard here
too.
> And especially for field and even more in-house debug tooling, you really
> want the userspace version matching your fw anyway.
Yes, this is definately the case. But those tools are also private and
I think fall under the *people actually use* exception.
So, lets try again:
Each device type must be mindful of Linux's philosophy for stable ABI. The FW
RPC interface does not have to meet a strictly stable ABI, but it does need to
meet an expectation that userspace tools that are deployed and in significant
use don't needlessly break. FW upgrade and kernel upgrade should keep widely
deployed tooling working.
Development and debugging focused RPCs under more permissive scopes can have
less stablitiy if the tools using them are only run under exceptional
circumstances and not for every day use of the device. Debugging tools may even
require exact version matching as they may require something similar to DWARF
debug information from the FW binary.
> Currently that mess tends to leave in debugfs and/or out-of-tree, so
> there's no stable uapi guarantee anyway. And I don't see the point in
> requiring it - if there is a need for stabling tooling, maybe that
> indicates more the need for a new subsystem that standardized things
> across devices/vendors.
Yes, fwctl needs to have both. I would expect things like the
configuration to have a fairly stable ABI. Maybe the list of
configurables will change but access to them should be ABI stable.
> > +Some examples:
> > +
> > + - HW RAID controllers. This includes RPCs to do things like compose drives into
> > + a RAID volume, configure RAID parameters, monitor the HW and more.
> > +
> > + - Baseboard managers. RPCs for configuring settings in the device and more
> > +
> > + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> > + functions that different products have defined, but more exists.
> > +
> > + - CXL also has a NVMe-like vendor command system.
> > +
> > + - DRM allows user space drivers to send commands to the device via kernel
> > + mediation
> > +
> > + - RDMA allows user space drivers to directly push commands to the device
> > + without kernel involvement
> > +
> > + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
> > +
> > +The first 4 are examples of areas that fwctl intends to cover.
>
> Maybe add a sentence here why the latter 3 aren't, just to strengthen that
> point?
How about:
The first 4 are examples of areas that fwctl intends to cover. The latter three
are examples of denied behavior as they fully overlap with the primary purpose
of a kernel subsystem.
Thanks
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 6/8] fwctl: Add documentation
2024-08-08 12:24 ` Jason Gunthorpe
@ 2024-08-09 9:21 ` Daniel Vetter
0 siblings, 0 replies; 48+ messages in thread
From: Daniel Vetter @ 2024-08-09 9:21 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Daniel Vetter, Jonathan Corbet, Itay Avraham, Jakub Kicinski,
Leon Romanovsky, linux-doc, linux-rdma, netdev, Paolo Abeni,
Saeed Mahameed, Tariq Toukan, Andy Gospodarek, Aron Silverton,
Dan Williams, David Ahern, Christoph Hellwig, Jiri Pirko,
Leonid Bloch, Leon Romanovsky, linux-cxl, patches
On Thu, Aug 08, 2024 at 09:24:13AM -0300, Jason Gunthorpe wrote:
> On Tue, Aug 06, 2024 at 10:03:36AM +0200, Daniel Vetter wrote:
> > On Mon, Jun 24, 2024 at 07:47:30PM -0300, Jason Gunthorpe wrote:
> > > Document the purpose and rules for the fwctl subsystem.
> > >
> > > Link in kdocs to the doc tree.
> > >
> > > Nacked-by: Jakub Kicinski <kuba@kernel.org>
> > > Link: https://lore.kernel.org/r/20240603114250.5325279c@kernel.org
> > > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> >
> > I think we'll need something like fwctl rather sooner than later for gpus
> > too, so for all the fwctl patches up to this one:
>
> Yes, I think so as well. You can see it already in the various GPU out
> of tree stuff where there is usually some expansive
> monitoring/debug/provisioning tool there too.
>
> > Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> Thanks!
>
> > I both really liked the approach to fwctl_unregister and
> > copy_struct_from_user, and didn't spot anything offensive in the code.
>
> It is copied from iommufd which copied concepts from the fixedup modern
> rdma :) Proven over a few years now.
>
> > > +Further, within the function, devices often have RPC commands that fall within
> > > +some general scopes of action:
> >
> > In your fwctl_rpc_scope you only have 4, not 5, and I think that's
> > clearer. Might also be good to put a kerneldoc link to the enum in here
> > for the details.
>
> I bundled these two together in the enum FWCTL_RPC_CONFIGURATION:
Yeah I figured, but if you want to keep the split it might help that the
kerneldoc for FWCTL_RPC_CONFIGURATION mentions that it includes both
delayed configuration and runtimme configuration from the overview doc. Or
maybe group them here into 1a and 1b.
Also this one is an extremely minor bikeshed, feel free to ignore :-)
> > > + 1. Access to function & child configuration, FLASH, etc/ that becomes live at a
> > > + function reset.
> > > +
> > > + 2. Access to function & child runtime configuration that kernel drivers can
> > > + discover at runtime.
> >
> > This one worries me, since it has potential for people to get it very
> > wrong and e.g. expose configuration where it's only safe if the driver
> > isn't bound, or at least no userspace is using it.
>
> The notion was "at runtime" meaning any active user of the device
> would either not be aware of whatever change or already have some way
> to learn about it.
>
> Especially when we think about child configuration - ie configuration
> of a VF from the hypervisor while a VM is running - there can be
> useful things that fit under this category. For instance you might
> throttle the device to support live migration.
>
> Throttling is a really complex topic for an autonomous device like
> GPU/RDMA.
Yeah these kind of things are imo fine. But when I read your description
of "can discover at runtime" I'm more thinking stuff like number of
compute cores, or channels for communication or whatever. Which you can
discover, but if you don't you discover it by failing because the thing
you thought was there is now suddenly gone.
Ofc the guest can discover throttling by noticing that its gpu suddenly
got a bit (or a lot) slower, but that's a different kind of discover imo.
> > I'd drop this and just
> > leave the one configuration rpc, with maybe more detail what exactly "When
> > configuration is written to the device remains in a fully supported
> > state."
>
> Ah, this language is incorporating the distro feedback around loosing
> the ability to support the system.
Maybe I'm just dense, but for me it'd be good to differentiate between
runtime changes like throttling, which generally shouldn't upset
drivers/guests. And changes which can if they don't go actively discover
them, and making it really clear the latter must be delayed until next
reset. Currently I read what you have as including the latter as allowed
without requiring reset, as long as the driver/guest _can_ discover the
change somehow.
So maybe something like this?
2. Access to function & child runtime configuration that are either
transparent to users of that function, or which can be discovered without
disruption (like when the fw/hw function interface makes provisions for
such runtime configuration changes in the programming model).
> > from the kerneldoc means. I think only safe options woulb be a)
> > applied on function reset b) transparent to both kernel and userspace
> > (beyond maybe device performance).
>
> Let's lean into transparent more:
>
> 1. Access to function & child configuration, FLASH, etc. that becomes live at a
> function reset. Access to function & child runtime configuration that is
> transparent or non-disruptive to any driver or VM.
Ack, sounds really clear to me now.
>
> > That might not cut all configuration items, but for those I think it'd be
> > best if fwctl guarantees through the driver model that there's no driver
> > bound to that function (or used by vfio/kvm), to guarantee safety. So
> > explicitly split them out as runtime configuration with a distinct rpc
> > scope. Maybe an addition for later.
>
> No driver bound is too strong. VFIO and even mlx5_core can trigger a
> FLR while still bound in a controlled way. Taking effect at FLR time
> is a reasonable restriction.
>
> Like if you reconfigure a child VF and then start a VM there will be
> an automatic FLR in that process that can make the updated VF
> configuration active.
Ah yeah if the driver copes/initiates, then it's fine too. Maybe a case of
both hw reset and "no driver bound", but also a case of "details we can
figure out later".
> > > + 3. Directly configure or otherwise control kernel drivers. A subsystem kernel
> > > + driver can react to the device configuration at function reset/driver load
> > > + time, but otherwise should not be coupled to fwctl.
> >
> > Kinda the same worry as above, I think this should be "... but otherwise
> > _must_ not be coupled to fwctl".
>
> Yep
>
> > > + 4. Operate the HW in a way that overlaps with the core purpose of another
> > > + primary kernel subsystem, such as read/write to LBAs, send/receive of
> > > + network packets, or operate an accelerator's data plane.
> >
> > I still think some words about what to do when fwctl exposes some
> > functional which later on is covered by a newly added subsystem that
> > didn't yet exist. Also maybe adding some more examples from the RAS side
> > of things, since that's come up a few time in the ksummit-discuss thread,
> > plus I think it's where we'll most likely have a new subsystem or extended
> > functionality of an existing one pop up and cause conflicts with fwctl
> > rpcs that have landed earlier.
>
> How about:
>
> Operations exposed through fwctl's non-taining interfaces should be fully
> sharable with other users of the device. For instance exposing a RPC through
> fwctl should never prevent a kernel subsystem from also concurrently using that
> same RPC or hardware unit down the road. In such cases fwctl will be less
> important than proper kernel subsystems that eventually emerge. Mistakes in this
> area resulting in clashes will be resolved in favour of a kernel implementation.
Ack.
>
> > > +fwctl Driver design
> > > +-------------------
> > > +
> > > +In many cases a fwctl driver is going to be part of a larger cross-subsystem
> > > +device possibly using the auxiliary_device mechanism. In that case several
> > > +subsystems are going to be sharing the same device and FW interface layer so the
> > > +device design must already provide for isolation and cooperation between kernel
> > > +subsystems. fwctl should fit into that same model.
> > > +
> > > +Part of the driver should include a description of how its scope restrictions
> > > +and security model work. The driver and FW together must ensure that RPCs
> > > +provided by user space are mapped to the appropriate scope. If the validation is
> > > +done in the driver then the validation can read a 'command effects' report from
> > > +the device, or hardwire the enforcement. If the validation is done in the FW,
> > > +then the driver should pass the fwctl_rpc_scope to the FW along with the command.
> > > +
> > > +The driver and FW must cooperate to ensure that either fwctl cannot allocate
> > > +any FW resources, or any resources it does allocate are freed on FD closure. A
> > > +driver primarily constructed around FW RPCs may find that its core PCI function
> > > +and RPC layer belongs under fwctl with auxiliary devices connecting to other
> > > +subsystems.
> > > +
> > > +Each device type must represent a stable FW ABI, such that the userspace
> > > +components have the same general stability we expect from the kernel. FW upgrade
> > > +should not break the userspace tools.
> >
> > I think an exception for the debug rpcs (or maybe only
> > FWCTL_DEBUG_WRITE_FULL if we're super strict) could really help the case
> > for fwctl. Still not allowing to break individual rpcs, but maybe allow fw
> > to remove outdated ones.
>
> I'm definitely mindful of Linus's pragmatic view of ABI stability, where
> existing tools that *people actually use* shouldn't break. You
> shouldn't have to upgrade your tools when you upgrade your FW.
>
> I think it is important to convey that as a the gold standard here
> too.
>
> > And especially for field and even more in-house debug tooling, you really
> > want the userspace version matching your fw anyway.
>
> Yes, this is definately the case. But those tools are also private and
> I think fall under the *people actually use* exception.
>
> So, lets try again:
>
> Each device type must be mindful of Linux's philosophy for stable ABI. The FW
> RPC interface does not have to meet a strictly stable ABI, but it does need to
> meet an expectation that userspace tools that are deployed and in significant
> use don't needlessly break. FW upgrade and kernel upgrade should keep widely
> deployed tooling working.
>
> Development and debugging focused RPCs under more permissive scopes can have
> less stablitiy if the tools using them are only run under exceptional
> circumstances and not for every day use of the device. Debugging tools may even
> require exact version matching as they may require something similar to DWARF
> debug information from the FW binary.
Perfect imo, ack.
> > Currently that mess tends to leave in debugfs and/or out-of-tree, so
> > there's no stable uapi guarantee anyway. And I don't see the point in
> > requiring it - if there is a need for stabling tooling, maybe that
> > indicates more the need for a new subsystem that standardized things
> > across devices/vendors.
>
> Yes, fwctl needs to have both. I would expect things like the
> configuration to have a fairly stable ABI. Maybe the list of
> configurables will change but access to them should be ABI stable.
>
> > > +Some examples:
> > > +
> > > + - HW RAID controllers. This includes RPCs to do things like compose drives into
> > > + a RAID volume, configure RAID parameters, monitor the HW and more.
> > > +
> > > + - Baseboard managers. RPCs for configuring settings in the device and more
> > > +
> > > + - NVMe vendor command capsules. nvme-cli provides access to some monitoring
> > > + functions that different products have defined, but more exists.
> > > +
> > > + - CXL also has a NVMe-like vendor command system.
> > > +
> > > + - DRM allows user space drivers to send commands to the device via kernel
> > > + mediation
> > > +
> > > + - RDMA allows user space drivers to directly push commands to the device
> > > + without kernel involvement
> > > +
> > > + - Various “raw” APIs, raw HID (SDL2), raw USB, NVMe Generic Interface, etc
> > > +
> > > +The first 4 are examples of areas that fwctl intends to cover.
> >
> > Maybe add a sentence here why the latter 3 aren't, just to strengthen that
> > point?
>
> How about:
>
> The first 4 are examples of areas that fwctl intends to cover. The latter three
> are examples of denied behavior as they fully overlap with the primary purpose
> of a kernel subsystem.
Ack.
Cheers, Sima
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
` (5 preceding siblings ...)
2024-06-24 22:47 ` [PATCH v2 6/8] fwctl: Add documentation Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-07-26 16:10 ` Jonathan Cameron
2024-06-24 22:47 ` [PATCH v2 8/8] mlx5: Create an auxiliary device for fwctl_mlx5 Jason Gunthorpe
2024-06-24 23:18 ` [PATCH v2 0/8] Introduce fwctl subystem Jakub Kicinski
8 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
From: Saeed Mahameed <saeedm@nvidia.com>
mlx5's fw has long provided a User Context concept. This has a long
history in RDMA as part of the devx extended verbs programming
interface. A User Context is a security envelope that contains objects and
controls access. It contains the Protection Domain object from the
InfiniBand Architecture and both togther provide the OS with the necessary
tools to bind a security context like a process to the device.
The security context is restricted to not be able to touch the kernel or
other processes. In the RDMA verbs case it is also restricted to not touch
global device resources.
The fwctl_mlx5 takes this approach and builds a User Context per fwctl
file descriptor and uses a FW security capability on the User Context to
enable access to global device resources. This makes the context useful
for provisioning and debugging the global device state.
mlx5 already has a robust infrastructure for delivering RPC messages to
fw. Trivially connect fwctl's RPC mechanism to mlx5_cmd_do(). Enforce the
User Context ID in every RPC header so the FW knows the security context
of the issuing ID.
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
MAINTAINERS | 7 +
drivers/fwctl/Kconfig | 14 ++
drivers/fwctl/Makefile | 1 +
drivers/fwctl/mlx5/Makefile | 4 +
drivers/fwctl/mlx5/main.c | 337 ++++++++++++++++++++++++++++++++++++
include/uapi/fwctl/fwctl.h | 1 +
include/uapi/fwctl/mlx5.h | 36 ++++
7 files changed, 400 insertions(+)
create mode 100644 drivers/fwctl/mlx5/Makefile
create mode 100644 drivers/fwctl/mlx5/main.c
create mode 100644 include/uapi/fwctl/mlx5.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 2090009a6ae98a..f0689e510510b3 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9085,6 +9085,13 @@ F: drivers/fwctl/
F: include/linux/fwctl.h
F: include/uapi/fwctl/
+FWCTL MLX5 DRIVER
+M: Saeed Mahameed <saeedm@nvidia.com>
+R: Itay Avraham <itayavr@nvidia.com>
+L: linux-kernel@vger.kernel.org
+S: Maintained
+F: drivers/fwctl/mlx5/
+
GALAXYCORE GC0308 CAMERA SENSOR DRIVER
M: Sebastian Reichel <sre@kernel.org>
L: linux-media@vger.kernel.org
diff --git a/drivers/fwctl/Kconfig b/drivers/fwctl/Kconfig
index 37147a695add9a..e5ee2d46d43126 100644
--- a/drivers/fwctl/Kconfig
+++ b/drivers/fwctl/Kconfig
@@ -7,3 +7,17 @@ menuconfig FWCTL
support a wide range of lockdown compatible device behaviors including
manipulating device FLASH, debugging, and other activities that don't
fit neatly into an existing subsystem.
+
+if FWCTL
+config FWCTL_MLX5
+ tristate "mlx5 ConnectX control fwctl driver"
+ depends on MLX5_CORE
+ help
+ MLX5CTL provides interface for the user process to access the debug and
+ configuration registers of the ConnectX hardware family
+ (NICs, PCI switches and SmartNIC SoCs).
+ This will allow configuration and debug tools to work out of the box on
+ mainstream kernel.
+
+ If you don't know what to do here, say N.
+endif
diff --git a/drivers/fwctl/Makefile b/drivers/fwctl/Makefile
index 1cad210f6ba580..1c535f694d7fe4 100644
--- a/drivers/fwctl/Makefile
+++ b/drivers/fwctl/Makefile
@@ -1,4 +1,5 @@
# SPDX-License-Identifier: GPL-2.0
obj-$(CONFIG_FWCTL) += fwctl.o
+obj-$(CONFIG_FWCTL_MLX5) += mlx5/
fwctl-y += main.o
diff --git a/drivers/fwctl/mlx5/Makefile b/drivers/fwctl/mlx5/Makefile
new file mode 100644
index 00000000000000..139a23e3c7c517
--- /dev/null
+++ b/drivers/fwctl/mlx5/Makefile
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-$(CONFIG_FWCTL_MLX5) += mlx5_fwctl.o
+
+mlx5_fwctl-y += main.o
diff --git a/drivers/fwctl/mlx5/main.c b/drivers/fwctl/mlx5/main.c
new file mode 100644
index 00000000000000..5e64371d7e5508
--- /dev/null
+++ b/drivers/fwctl/mlx5/main.c
@@ -0,0 +1,337 @@
+// SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
+ */
+#include <linux/fwctl.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mlx5/device.h>
+#include <linux/mlx5/driver.h>
+#include <uapi/fwctl/mlx5.h>
+
+#define mlx5ctl_err(mcdev, format, ...) \
+ dev_err(&mcdev->fwctl.dev, format, ##__VA_ARGS__)
+
+#define mlx5ctl_dbg(mcdev, format, ...) \
+ dev_dbg(&mcdev->fwctl.dev, "PID %u: " format, current->pid, \
+ ##__VA_ARGS__)
+
+struct mlx5ctl_uctx {
+ struct fwctl_uctx uctx;
+ u32 uctx_caps;
+ u32 uctx_uid;
+};
+
+struct mlx5ctl_dev {
+ struct fwctl_device fwctl;
+ struct mlx5_core_dev *mdev;
+};
+DEFINE_FREE(mlx5ctl, struct mlx5ctl_dev *, if (_T) fwctl_put(&_T->fwctl));
+
+struct mlx5_ifc_mbox_in_hdr_bits {
+ u8 opcode[0x10];
+ u8 uid[0x10];
+
+ u8 reserved_at_20[0x10];
+ u8 op_mod[0x10];
+
+ u8 reserved_at_40[0x40];
+};
+
+struct mlx5_ifc_mbox_out_hdr_bits {
+ u8 status[0x8];
+ u8 reserved_at_8[0x18];
+
+ u8 syndrome[0x20];
+
+ u8 reserved_at_40[0x40];
+};
+
+enum {
+ MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES = 0x4,
+};
+
+enum {
+ MLX5_CMD_OP_QUERY_DIAGNOSTIC_PARAMS = 0x819,
+ MLX5_CMD_OP_SET_DIAGNOSTIC_PARAMS = 0x820,
+ MLX5_CMD_OP_QUERY_DIAGNOSTIC_COUNTERS = 0x821,
+ MLX5_CMD_OP_POSTPONE_CONNECTED_QP_TIMEOUT = 0xb2e,
+};
+
+static int mlx5ctl_alloc_uid(struct mlx5ctl_dev *mcdev, u32 cap)
+{
+ u32 out[MLX5_ST_SZ_DW(create_uctx_out)] = {};
+ u32 in[MLX5_ST_SZ_DW(create_uctx_in)] = {};
+ void *uctx;
+ int ret;
+ u16 uid;
+
+ uctx = MLX5_ADDR_OF(create_uctx_in, in, uctx);
+
+ mlx5ctl_dbg(mcdev, "%s: caps 0x%x\n", __func__, cap);
+ MLX5_SET(create_uctx_in, in, opcode, MLX5_CMD_OP_CREATE_UCTX);
+ MLX5_SET(uctx, uctx, cap, cap);
+
+ ret = mlx5_cmd_exec(mcdev->mdev, in, sizeof(in), out, sizeof(out));
+ if (ret)
+ return ret;
+
+ uid = MLX5_GET(create_uctx_out, out, uid);
+ mlx5ctl_dbg(mcdev, "allocated uid %u with caps 0x%x\n", uid, cap);
+ return uid;
+}
+
+static void mlx5ctl_release_uid(struct mlx5ctl_dev *mcdev, u16 uid)
+{
+ u32 in[MLX5_ST_SZ_DW(destroy_uctx_in)] = {};
+ struct mlx5_core_dev *mdev = mcdev->mdev;
+ int ret;
+
+ MLX5_SET(destroy_uctx_in, in, opcode, MLX5_CMD_OP_DESTROY_UCTX);
+ MLX5_SET(destroy_uctx_in, in, uid, uid);
+
+ ret = mlx5_cmd_exec_in(mdev, destroy_uctx, in);
+ mlx5ctl_dbg(mcdev, "released uid %u %pe\n", uid, ERR_PTR(ret));
+}
+
+static int mlx5ctl_open_uctx(struct fwctl_uctx *uctx)
+{
+ struct mlx5ctl_uctx *mfd =
+ container_of(uctx, struct mlx5ctl_uctx, uctx);
+ struct mlx5ctl_dev *mcdev =
+ container_of(uctx->fwctl, struct mlx5ctl_dev, fwctl);
+ int uid;
+
+ /*
+ * New FW supports the TOOLS_RESOURCES uid security label
+ * which allows commands to manipulate the global device state.
+ * Otherwise only basic existing RDMA devx privilege are allowed.
+ */
+ if (MLX5_CAP_GEN(mcdev->mdev, uctx_cap) &
+ MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES)
+ mfd->uctx_caps |= MLX5_UCTX_OBJECT_CAP_TOOLS_RESOURCES;
+
+ uid = mlx5ctl_alloc_uid(mcdev, mfd->uctx_caps);
+ if (uid < 0)
+ return uid;
+
+ mfd->uctx_uid = uid;
+ return 0;
+}
+
+static void mlx5ctl_close_uctx(struct fwctl_uctx *uctx)
+{
+ struct mlx5ctl_dev *mcdev =
+ container_of(uctx->fwctl, struct mlx5ctl_dev, fwctl);
+ struct mlx5ctl_uctx *mfd =
+ container_of(uctx, struct mlx5ctl_uctx, uctx);
+
+ mlx5ctl_release_uid(mcdev, mfd->uctx_uid);
+}
+
+static void *mlx5ctl_info(struct fwctl_uctx *uctx, size_t *length)
+{
+ struct mlx5ctl_uctx *mfd =
+ container_of(uctx, struct mlx5ctl_uctx, uctx);
+ struct fwctl_info_mlx5 *info;
+
+ info = kzalloc(sizeof(*info), GFP_KERNEL);
+ if (!info)
+ return ERR_PTR(-ENOMEM);
+
+ info->uid = mfd->uctx_uid;
+ info->uctx_caps = mfd->uctx_caps;
+ *length = sizeof(*info);
+ return info;
+}
+
+static bool mlx5ctl_validate_rpc(const void *in, enum fwctl_rpc_scope scope)
+{
+ u16 opcode = MLX5_GET(mbox_in_hdr, in, opcode);
+
+ /*
+ * Currently the driver can't keep track of commands that allocate
+ * objects in the FW, these commands are safe from a security
+ * perspective but nothing will free the memory when the FD is closed.
+ * For now permit only query commands. Also the caps for the scope have
+ * not been defined yet, filter commands manually for now.
+ */
+ switch (opcode) {
+ case MLX5_CMD_OP_POSTPONE_CONNECTED_QP_TIMEOUT:
+ case MLX5_CMD_OP_QUERY_ADAPTER:
+ case MLX5_CMD_OP_QUERY_ESW_FUNCTIONS:
+ case MLX5_CMD_OP_QUERY_HCA_CAP:
+ case MLX5_CMD_OP_QUERY_HCA_VPORT_CONTEXT:
+ case MLX5_CMD_OP_QUERY_ROCE_ADDRESS:
+ return scope <= FWCTL_RPC_CONFIGURATION;
+
+ case MLX5_CMD_OP_QUERY_CONG_PARAMS:
+ case MLX5_CMD_OP_QUERY_CONG_STATISTICS:
+ case MLX5_CMD_OP_QUERY_CONG_STATUS:
+ case MLX5_CMD_OP_QUERY_CQ:
+ case MLX5_CMD_OP_QUERY_DCT:
+ case MLX5_CMD_OP_QUERY_DIAGNOSTIC_COUNTERS:
+ case MLX5_CMD_OP_QUERY_DIAGNOSTIC_PARAMS:
+ case MLX5_CMD_OP_QUERY_EQ:
+ case MLX5_CMD_OP_QUERY_ESW_VPORT_CONTEXT:
+ case MLX5_CMD_OP_QUERY_FLOW_COUNTER:
+ case MLX5_CMD_OP_QUERY_FLOW_GROUP:
+ case MLX5_CMD_OP_QUERY_FLOW_TABLE_ENTRY:
+ case MLX5_CMD_OP_QUERY_FLOW_TABLE:
+ case MLX5_CMD_OP_QUERY_GENERAL_OBJECT:
+ case MLX5_CMD_OP_QUERY_ISSI:
+ case MLX5_CMD_OP_QUERY_L2_TABLE_ENTRY:
+ case MLX5_CMD_OP_QUERY_LAG:
+ case MLX5_CMD_OP_QUERY_MAD_DEMUX:
+ case MLX5_CMD_OP_QUERY_MKEY:
+ case MLX5_CMD_OP_QUERY_MODIFY_HEADER_CONTEXT:
+ case MLX5_CMD_OP_QUERY_PACKET_REFORMAT_CONTEXT:
+ case MLX5_CMD_OP_QUERY_PAGES:
+ case MLX5_CMD_OP_QUERY_Q_COUNTER:
+ case MLX5_CMD_OP_QUERY_QP:
+ case MLX5_CMD_OP_QUERY_RMP:
+ case MLX5_CMD_OP_QUERY_RQ:
+ case MLX5_CMD_OP_QUERY_RQT:
+ case MLX5_CMD_OP_QUERY_SCHEDULING_ELEMENT:
+ case MLX5_CMD_OP_QUERY_SPECIAL_CONTEXTS:
+ case MLX5_CMD_OP_QUERY_SQ:
+ case MLX5_CMD_OP_QUERY_SRQ:
+ case MLX5_CMD_OP_QUERY_TIR:
+ case MLX5_CMD_OP_QUERY_TIS:
+ case MLX5_CMD_OP_QUERY_VHCA_MIGRATION_STATE:
+ case MLX5_CMD_OP_QUERY_VNIC_ENV:
+ case MLX5_CMD_OP_QUERY_VPORT_COUNTER:
+ case MLX5_CMD_OP_QUERY_VPORT_STATE:
+ case MLX5_CMD_OP_QUERY_WOL_ROL:
+ case MLX5_CMD_OP_QUERY_XRC_SRQ:
+ case MLX5_CMD_OP_QUERY_XRQ_DC_PARAMS_ENTRY:
+ case MLX5_CMD_OP_QUERY_XRQ_ERROR_PARAMS:
+ case MLX5_CMD_OP_QUERY_XRQ:
+ return scope <= FWCTL_RPC_DEBUG_READ_ONLY;
+
+ case MLX5_CMD_OP_SET_DIAGNOSTIC_PARAMS:
+ return scope <= FWCTL_RPC_DEBUG_WRITE;
+
+ case MLX5_CMD_OP_ACCESS_REG:
+ return scope <= FWCTL_RPC_DEBUG_WRITE_FULL;
+ default:
+ return false;
+ }
+}
+
+static void *mlx5ctl_fw_rpc(struct fwctl_uctx *uctx, enum fwctl_rpc_scope scope,
+ void *rpc_in, size_t in_len, size_t *out_len)
+{
+ struct mlx5ctl_dev *mcdev =
+ container_of(uctx->fwctl, struct mlx5ctl_dev, fwctl);
+ struct mlx5ctl_uctx *mfd =
+ container_of(uctx, struct mlx5ctl_uctx, uctx);
+ void *rpc_alloc __free(kvfree) = NULL;
+ void *rpc_out;
+ int ret;
+
+ if (in_len < MLX5_ST_SZ_BYTES(mbox_in_hdr) ||
+ *out_len < MLX5_ST_SZ_BYTES(mbox_out_hdr))
+ return ERR_PTR(-EMSGSIZE);
+
+ /* FIXME: Requires device support for more scopes */
+ if (scope != FWCTL_RPC_CONFIGURATION &&
+ scope != FWCTL_RPC_DEBUG_READ_ONLY)
+ return ERR_PTR(-EOPNOTSUPP);
+
+ mlx5ctl_dbg(mcdev, "[UID %d] cmdif: opcode 0x%x inlen %zu outlen %zu\n",
+ mfd->uctx_uid, MLX5_GET(mbox_in_hdr, rpc_in, opcode),
+ in_len, *out_len);
+
+ if (!mlx5ctl_validate_rpc(rpc_in, scope))
+ return ERR_PTR(-EBADMSG);
+
+ /*
+ * mlx5_cmd_do() copies the input message to its own buffer before
+ * executing it, so we can reuse the allocation for the output.
+ */
+ if (*out_len <= in_len) {
+ rpc_out = rpc_in;
+ } else {
+ rpc_out = rpc_alloc = kvzalloc(*out_len, GFP_KERNEL);
+ if (!rpc_alloc)
+ return ERR_PTR(-ENOMEM);
+ }
+
+ /* Enforce the user context for the command */
+ MLX5_SET(mbox_in_hdr, rpc_in, uid, mfd->uctx_uid);
+ ret = mlx5_cmd_do(mcdev->mdev, rpc_in, in_len, rpc_out, *out_len);
+
+ mlx5ctl_dbg(mcdev,
+ "[UID %d] cmdif: opcode 0x%x status 0x%x retval %pe\n",
+ mfd->uctx_uid, MLX5_GET(mbox_in_hdr, rpc_in, opcode),
+ MLX5_GET(mbox_out_hdr, rpc_out, status), ERR_PTR(ret));
+
+ /*
+ * -EREMOTEIO means execution succeeded and the out is valid,
+ * but an error code was returned inside out. Everything else
+ * means the RPC did not make it to the device.
+ */
+ if (ret && ret != -EREMOTEIO)
+ return ERR_PTR(ret);
+ if (rpc_out == rpc_in)
+ return rpc_in;
+ return_ptr(rpc_alloc);
+}
+
+static const struct fwctl_ops mlx5ctl_ops = {
+ .device_type = FWCTL_DEVICE_TYPE_MLX5,
+ .uctx_size = sizeof(struct mlx5ctl_uctx),
+ .open_uctx = mlx5ctl_open_uctx,
+ .close_uctx = mlx5ctl_close_uctx,
+ .info = mlx5ctl_info,
+ .fw_rpc = mlx5ctl_fw_rpc,
+};
+
+static int mlx5ctl_probe(struct auxiliary_device *adev,
+ const struct auxiliary_device_id *id)
+
+{
+ struct mlx5_adev *madev = container_of(adev, struct mlx5_adev, adev);
+ struct mlx5_core_dev *mdev = madev->mdev;
+ struct mlx5ctl_dev *mcdev __free(mlx5ctl) = fwctl_alloc_device(
+ &mdev->pdev->dev, &mlx5ctl_ops, struct mlx5ctl_dev, fwctl);
+ int ret;
+
+ if (!mcdev)
+ return -ENOMEM;
+
+ mcdev->mdev = mdev;
+
+ ret = fwctl_register(&mcdev->fwctl);
+ if (ret)
+ return ret;
+ auxiliary_set_drvdata(adev, no_free_ptr(mcdev));
+ return 0;
+}
+
+static void mlx5ctl_remove(struct auxiliary_device *adev)
+{
+ struct mlx5ctl_dev *mcdev __free(mlx5ctl) = auxiliary_get_drvdata(adev);
+
+ fwctl_unregister(&mcdev->fwctl);
+}
+
+static const struct auxiliary_device_id mlx5ctl_id_table[] = {
+ {.name = MLX5_ADEV_NAME ".fwctl",},
+ {},
+};
+MODULE_DEVICE_TABLE(auxiliary, mlx5ctl_id_table);
+
+static struct auxiliary_driver mlx5ctl_driver = {
+ .name = "mlx5_fwctl",
+ .probe = mlx5ctl_probe,
+ .remove = mlx5ctl_remove,
+ .id_table = mlx5ctl_id_table,
+};
+
+module_auxiliary_driver(mlx5ctl_driver);
+
+MODULE_IMPORT_NS(FWCTL);
+MODULE_DESCRIPTION("mlx5 ConnectX fwctl driver");
+MODULE_AUTHOR("Saeed Mahameed <saeedm@nvidia.com>");
+MODULE_LICENSE("Dual BSD/GPL");
diff --git a/include/uapi/fwctl/fwctl.h b/include/uapi/fwctl/fwctl.h
index 8bde0d4416fd55..49a357e1bc713f 100644
--- a/include/uapi/fwctl/fwctl.h
+++ b/include/uapi/fwctl/fwctl.h
@@ -42,6 +42,7 @@ enum {
enum fwctl_device_type {
FWCTL_DEVICE_TYPE_ERROR = 0,
+ FWCTL_DEVICE_TYPE_MLX5 = 1,
};
/**
diff --git a/include/uapi/fwctl/mlx5.h b/include/uapi/fwctl/mlx5.h
new file mode 100644
index 00000000000000..bcb4602ffdeee4
--- /dev/null
+++ b/include/uapi/fwctl/mlx5.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+/*
+ * Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES
+ *
+ * These are definitions for the command interface for mlx5 HW. mlx5 FW has a
+ * User Context mechanism which allows the FW to understand a security scope.
+ * FWCTL binds each FD to a FW user context and then places the User Context ID
+ * (UID) in each command header. The created User Context has a capability set
+ * that is appropriate for FWCTL's security model.
+ *
+ * Command formation should use a copy of the structs in mlx5_ifc.h following
+ * the Programmers Reference Manual. A open release is available here:
+ *
+ * https://network.nvidia.com/files/doc-2020/ethernet-adapters-programming-manual.pdf
+ *
+ * The device_type for this file is FWCTL_DEVICE_TYPE_MLX5.
+ */
+#ifndef _UAPI_FWCTL_MLX5_H
+#define _UAPI_FWCTL_MLX5_H
+
+#include <linux/types.h>
+
+/**
+ * struct fwctl_info_mlx5 - ioctl(FWCTL_INFO) out_device_data
+ * @uid: The FW UID this FD is bound to. Each command header will force
+ * this value.
+ * @uctx_caps: The FW capabilities that are enabled for the uid.
+ *
+ * Return basic information about the FW interface available.
+ */
+struct fwctl_info_mlx5 {
+ __u32 uid;
+ __u32 uctx_caps;
+};
+
+#endif
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw
2024-06-24 22:47 ` [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw Jason Gunthorpe
@ 2024-07-26 16:10 ` Jonathan Cameron
2024-07-29 16:22 ` Jason Gunthorpe
0 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-26 16:10 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Mon, 24 Jun 2024 19:47:31 -0300
Jason Gunthorpe <jgg@nvidia.com> wrote:
> From: Saeed Mahameed <saeedm@nvidia.com>
>
> mlx5's fw has long provided a User Context concept. This has a long
> history in RDMA as part of the devx extended verbs programming
> interface. A User Context is a security envelope that contains objects and
> controls access. It contains the Protection Domain object from the
> InfiniBand Architecture and both togther provide the OS with the necessary
> tools to bind a security context like a process to the device.
>
> The security context is restricted to not be able to touch the kernel or
> other processes. In the RDMA verbs case it is also restricted to not touch
> global device resources.
>
> The fwctl_mlx5 takes this approach and builds a User Context per fwctl
> file descriptor and uses a FW security capability on the User Context to
> enable access to global device resources. This makes the context useful
> for provisioning and debugging the global device state.
>
> mlx5 already has a robust infrastructure for delivering RPC messages to
> fw. Trivially connect fwctl's RPC mechanism to mlx5_cmd_do(). Enforce the
> User Context ID in every RPC header so the FW knows the security context
> of the issuing ID.
>
> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
A few minor comments + a reference counting question.
> diff --git a/drivers/fwctl/Kconfig b/drivers/fwctl/Kconfig
> index 37147a695add9a..e5ee2d46d43126 100644
> --- a/drivers/fwctl/Kconfig
> +++ b/drivers/fwctl/Kconfig
> @@ -7,3 +7,17 @@ menuconfig FWCTL
> support a wide range of lockdown compatible device behaviors including
> manipulating device FLASH, debugging, and other activities that don't
> fit neatly into an existing subsystem.
> +
> +if FWCTL
Why not use depends on FWCTL?
> +config FWCTL_MLX5
> + tristate "mlx5 ConnectX control fwctl driver"
> + depends on MLX5_CORE
> + help
> + MLX5CTL provides interface for the user process to access the debug and
> + configuration registers of the ConnectX hardware family
> + (NICs, PCI switches and SmartNIC SoCs).
> + This will allow configuration and debug tools to work out of the box on
> + mainstream kernel.
> +
> + If you don't know what to do here, say N.
> +endif
> diff --git a/drivers/fwctl/mlx5/main.c b/drivers/fwctl/mlx5/main.c
> new file mode 100644
> index 00000000000000..5e64371d7e5508
> --- /dev/null
> +++ b/drivers/fwctl/mlx5/main.c
> +static void mlx5ctl_remove(struct auxiliary_device *adev)
> +{
> + struct mlx5ctl_dev *mcdev __free(mlx5ctl) = auxiliary_get_drvdata(adev);
So this is calling fwctl_put(&mcdev->fwctl) on scope exit.
Why do you need to drop a reference beyond the one fwctl_unregister() is dropping
in cdev_device_del()? Where am I missing a reference get?
> +
> + fwctl_unregister(&mcdev->fwctl);
> +}
> +
> +static const struct auxiliary_device_id mlx5ctl_id_table[] = {
> + {.name = MLX5_ADEV_NAME ".fwctl",},
> + {},
No point in comma after terminating entries
> +};
> +MODULE_DEVICE_TABLE(auxiliary, mlx5ctl_id_table);
> +
> +static struct auxiliary_driver mlx5ctl_driver = {
> + .name = "mlx5_fwctl",
> + .probe = mlx5ctl_probe,
> + .remove = mlx5ctl_remove,
> + .id_table = mlx5ctl_id_table,
> +};
> +
> +module_auxiliary_driver(mlx5ctl_driver);
> +
> +MODULE_IMPORT_NS(FWCTL);
> +MODULE_DESCRIPTION("mlx5 ConnectX fwctl driver");
> +MODULE_AUTHOR("Saeed Mahameed <saeedm@nvidia.com>");
> +MODULE_LICENSE("Dual BSD/GPL");
> +#endif
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw
2024-07-26 16:10 ` Jonathan Cameron
@ 2024-07-29 16:22 ` Jason Gunthorpe
2024-07-31 11:52 ` Jonathan Cameron
0 siblings, 1 reply; 48+ messages in thread
From: Jason Gunthorpe @ 2024-07-29 16:22 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Fri, Jul 26, 2024 at 05:10:13PM +0100, Jonathan Cameron wrote:
> > diff --git a/drivers/fwctl/Kconfig b/drivers/fwctl/Kconfig
> > index 37147a695add9a..e5ee2d46d43126 100644
> > --- a/drivers/fwctl/Kconfig
> > +++ b/drivers/fwctl/Kconfig
> > @@ -7,3 +7,17 @@ menuconfig FWCTL
> > support a wide range of lockdown compatible device behaviors including
> > manipulating device FLASH, debugging, and other activities that don't
> > fit neatly into an existing subsystem.
> > +
> > +if FWCTL
>
> Why not use depends on FWCTL?
This is a "safer" pattern for kconfig if you expect a list of
drivers. You put all the driver kconfig stanza's within the above if
and then they all pick it up correctly and consistently. Otherwise you
have to replicate the depends line.
> > +static void mlx5ctl_remove(struct auxiliary_device *adev)
> > +{
> > + struct mlx5ctl_dev *mcdev __free(mlx5ctl) = auxiliary_get_drvdata(adev);
>
> So this is calling fwctl_put(&mcdev->fwctl) on scope exit.
>
> Why do you need to drop a reference beyond the one fwctl_unregister() is dropping
> in cdev_device_del()? Where am I missing a reference get?
fwctl_register() / fwctl_unregister() are pairs. Internally they pair
cdev_device_add() / cdev_device_del() which decrease some internal
cdev refcounts.
_alloc_device() / __free(mlx5ctl) above are the other pair.
device_initialize() holds a reference from probe to remove.
It has to work this way because if cdev_device_del() would put back
all the references we would immediately UAF, eg:
cdev_device_del(&fwctl->cdev, &fwctl->dev);
/* Disable and free the driver's resources for any still open FDs. */
guard(rwsem_write)(&fwctl->registration_lock);
guard(mutex)(&fwctl->uctx_list_lock);
^^^^^^^
Must still be allocated
And more broadly, though mlx5 does not use this, it would be safe for
a driver to do:
fwctl_unregister();
kfree(mcdev->mymemory);
^^^^^^ Must still be allocated!
fwctl_put(&mcdev->fwctl);
So we have the two steps where unregister makes it safe for the driver
to begin teardown but keeps memory around, and the final put which
releases the memory after driver teardown is done.
This is also captured in the cleanup.h notation:
struct mlx5ctl_dev *mcdev __free(mlx5ctl) = fwctl_alloc_device(
&mdev->pdev->dev, &mlx5ctl_ops, struct mlx5ctl_dev,
fwctl);
^^^^^^^^^^^^
Here we indicate we have a ref on the stack from
fwctl_alloc_device
auxiliary_set_drvdata(adev, no_free_ptr(mcdev));
^^^^^^^^^^^^^^^^^ Move the ref
into drvdata
struct mlx5ctl_dev *mcdev __free(mlx5ctl) = auxiliary_get_drvdata(adev);
^^^^^^^^^^^ Move the ref out of
drvdata onto the stack
> > +static const struct auxiliary_device_id mlx5ctl_id_table[] = {
> > + {.name = MLX5_ADEV_NAME ".fwctl",},
> > + {},
>
> No point in comma after terminating entries
Sure
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw
2024-07-29 16:22 ` Jason Gunthorpe
@ 2024-07-31 11:52 ` Jonathan Cameron
2024-08-01 13:25 ` Jason Gunthorpe
0 siblings, 1 reply; 48+ messages in thread
From: Jonathan Cameron @ 2024-07-31 11:52 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
> > > +static void mlx5ctl_remove(struct auxiliary_device *adev)
> > > +{
> > > + struct mlx5ctl_dev *mcdev __free(mlx5ctl) = auxiliary_get_drvdata(adev);
> >
> > So this is calling fwctl_put(&mcdev->fwctl) on scope exit.
> >
> > Why do you need to drop a reference beyond the one fwctl_unregister() is dropping
> > in cdev_device_del()? Where am I missing a reference get?
>
> fwctl_register() / fwctl_unregister() are pairs. Internally they pair
> cdev_device_add() / cdev_device_del() which decrease some internal
> cdev refcounts.
>
> _alloc_device() / __free(mlx5ctl) above are the other pair.
> device_initialize() holds a reference from probe to remove.
>
> It has to work this way because if cdev_device_del() would put back
> all the references we would immediately UAF, eg:
>
> cdev_device_del(&fwctl->cdev, &fwctl->dev);
>
> /* Disable and free the driver's resources for any still open FDs. */
> guard(rwsem_write)(&fwctl->registration_lock);
> guard(mutex)(&fwctl->uctx_list_lock);
> ^^^^^^^
> Must still be allocated
>
> And more broadly, though mlx5 does not use this, it would be safe for
> a driver to do:
>
> fwctl_unregister();
> kfree(mcdev->mymemory);
> ^^^^^^ Must still be allocated!
> fwctl_put(&mcdev->fwctl);
>
> So we have the two steps where unregister makes it safe for the driver
> to begin teardown but keeps memory around, and the final put which
> releases the memory after driver teardown is done.
>
> This is also captured in the cleanup.h notation:
>
> struct mlx5ctl_dev *mcdev __free(mlx5ctl) = fwctl_alloc_device(
> &mdev->pdev->dev, &mlx5ctl_ops, struct mlx5ctl_dev,
> fwctl);
> ^^^^^^^^^^^^
> Here we indicate we have a ref on the stack from
> fwctl_alloc_device
>
> auxiliary_set_drvdata(adev, no_free_ptr(mcdev));
> ^^^^^^^^^^^^^^^^^ Move the ref
> into drvdata
>
> struct mlx5ctl_dev *mcdev __free(mlx5ctl) = auxiliary_get_drvdata(adev);
> ^^^^^^^^^^^ Move the ref out of
> drvdata onto the stack
>
Thanks for the explanation. I clearly needed more coffee that day :)
Personally I find this to be a confusing use of scoped cleanup
as we aren't associating a constructor / destructor with scope, but
rather sort of 'adopting ownership / destructor'.
Assuming my caffeine level is better today, maybe device managed is
more appropriate?
devm_add_action_or_reset to associate the destructor by placing
it immediately after the setup path for both the allocate and unregister.
Should run in very nearly same order for teardown as what you have here.
Alternatively this is just a new pattern I should get used to.
Jonathan
^ permalink raw reply [flat|nested] 48+ messages in thread* Re: [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw
2024-07-31 11:52 ` Jonathan Cameron
@ 2024-08-01 13:25 ` Jason Gunthorpe
0 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-08-01 13:25 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan, Andy Gospodarek, Aron Silverton, Dan Williams,
David Ahern, Christoph Hellwig, Jiri Pirko, Leonid Bloch,
Leon Romanovsky, linux-cxl, patches
On Wed, Jul 31, 2024 at 12:52:32PM +0100, Jonathan Cameron wrote:
> Thanks for the explanation. I clearly needed more coffee that day :)
> Personally I find this to be a confusing use of scoped cleanup
> as we aren't associating a constructor / destructor with scope, but
> rather sort of 'adopting ownership / destructor'.
It is a pretty typical "move" concept for reference counting, as you
might see in other languages.
> Assuming my caffeine level is better today, maybe device managed is
> more appropriate?
Oh, perhaps controversial, but I dislike devm, so I'd rather not :) It
makes exciting bugs, IMHO. Someone (Laurent?) gave a nice presentation
on some of its nasty edge cases at LPC a few years ago.
> devm_add_action_or_reset to associate the destructor by placing
> it immediately after the setup path for both the allocate and unregister.
> Should run in very nearly same order for teardown as what you have here.
Yes, in this case it would work technically.
I think devm would be more code lines, more memory usage, and more
failure cases though.
So, I'm interested that cleanup could be a better option. I view it as
positive that the success path remove() flow is actually documented
what order it is doing things. Order is very important there and I've
seen enough places get things wrong here over the years..
One of the nasty traps of devm is you have to know to have your
creation order match your required destruction order, and *everything*
has to use devm or it can't be ordered.
Mind you cleanup has the same order trap, but you don't have to use
cleanup during remove(), and maybe it was overzealous to do so here.
Thanks,
Jason
^ permalink raw reply [flat|nested] 48+ messages in thread
* [PATCH v2 8/8] mlx5: Create an auxiliary device for fwctl_mlx5
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
` (6 preceding siblings ...)
2024-06-24 22:47 ` [PATCH v2 7/8] fwctl/mlx5: Support for communicating with mlx5 fw Jason Gunthorpe
@ 2024-06-24 22:47 ` Jason Gunthorpe
2024-06-24 23:18 ` [PATCH v2 0/8] Introduce fwctl subystem Jakub Kicinski
8 siblings, 0 replies; 48+ messages in thread
From: Jason Gunthorpe @ 2024-06-24 22:47 UTC (permalink / raw)
To: Jonathan Corbet, Itay Avraham, Jakub Kicinski, Leon Romanovsky,
linux-doc, linux-rdma, netdev, Paolo Abeni, Saeed Mahameed,
Tariq Toukan
Cc: Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
From: Saeed Mahameed <saeedm@nvidia.com>
If the device supports User Context then it can support fwctl. Create an
auxiliary device to allow fwctl to bind to it.
Create a sysfs like:
$ ls /sys/devices/pci0000:00/0000:00:0a.0/mlx5_core.fwctl.0/driver -l
lrwxrwxrwx 1 root root 0 Apr 25 19:46 /sys/devices/pci0000:00/0000:00:0a.0/mlx5_core.fwctl.0/driver -> ../../../../bus/auxiliary/drivers/mlx5_fwctl.mlx5_fwctl
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
drivers/net/ethernet/mellanox/mlx5/core/dev.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/dev.c b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
index 47e7c2639774fd..6781ddb090c475 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/dev.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/dev.c
@@ -228,8 +228,14 @@ enum {
MLX5_INTERFACE_PROTOCOL_VNET,
MLX5_INTERFACE_PROTOCOL_DPLL,
+ MLX5_INTERFACE_PROTOCOL_FWCTL,
};
+static bool is_fwctl_supported(struct mlx5_core_dev *dev)
+{
+ return MLX5_CAP_GEN(dev, uctx_cap);
+}
+
static const struct mlx5_adev_device {
const char *suffix;
bool (*is_supported)(struct mlx5_core_dev *dev);
@@ -252,6 +258,8 @@ static const struct mlx5_adev_device {
.is_supported = &is_mp_supported },
[MLX5_INTERFACE_PROTOCOL_DPLL] = { .suffix = "dpll",
.is_supported = &is_dpll_supported },
+ [MLX5_INTERFACE_PROTOCOL_FWCTL] = { .suffix = "fwctl",
+ .is_supported = &is_fwctl_supported },
};
int mlx5_adev_idx_alloc(void)
--
2.45.2
^ permalink raw reply related [flat|nested] 48+ messages in thread* Re: [PATCH v2 0/8] Introduce fwctl subystem
2024-06-24 22:47 [PATCH v2 0/8] Introduce fwctl subystem Jason Gunthorpe
` (7 preceding siblings ...)
2024-06-24 22:47 ` [PATCH v2 8/8] mlx5: Create an auxiliary device for fwctl_mlx5 Jason Gunthorpe
@ 2024-06-24 23:18 ` Jakub Kicinski
8 siblings, 0 replies; 48+ messages in thread
From: Jakub Kicinski @ 2024-06-24 23:18 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Jonathan Corbet, Itay Avraham, Leon Romanovsky, linux-doc,
linux-rdma, netdev, Paolo Abeni, Saeed Mahameed, Tariq Toukan,
Andy Gospodarek, Aron Silverton, Dan Williams, David Ahern,
Christoph Hellwig, Jiri Pirko, Leonid Bloch, Leon Romanovsky,
linux-cxl, patches
On Mon, 24 Jun 2024 19:47:24 -0300 Jason Gunthorpe wrote:
> fwctl is a new subsystem intended to bring some common rules and order to
> the growing pattern of exposing a secure FW interface directly to
> userspace. Unlike existing places like RDMA/DRM/VFIO/uacce that are
> exposing a device for datapath operations fwctl is focused on debugging,
> configuration and provisioning of the device. It will not have the
> necessary features like interrupt delivery to support a datapath.
>
> This concept is similar to the long standing practice in the "HW" RAID
> space of having a device specific misc device to manager the RAID
> controller FW. fwctl generalizes this notion of a companion debug and
> management interface that goes along with a dataplane implemented in an
> appropriate subsystem.
>
> The need for this has reached a critical point as many users are moving to
> run lockdown enabled kernels. Several existing devices have had long
> standing tooling for management that relied on /sys/../resource0 or PCI
> config space access which is not permitted in lockdown. A major point of
> fwctl is to define and document the rules that a device must follow to
> expose a lockdown compatible RPC.
>
> Based on some discussion fwctl splits the RPCs into four categories
>
> FWCTL_RPC_CONFIGURATION
> FWCTL_RPC_DEBUG_READ_ONLY
> FWCTL_RPC_DEBUG_WRITE
> FWCTL_RPC_DEBUG_WRITE_FULL
Nacked-by: Jakub Kicinski <kuba@kernel.org>
^ permalink raw reply [flat|nested] 48+ messages in thread