* [PATCH v2 5/5] driver core: remove driver_set_override()
From: Danilo Krummrich @ 2026-05-05 13:37 UTC (permalink / raw)
To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
wei.liu, decui, longli, andersson, mathieu.poirier
Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
linux-remoteproc, Danilo Krummrich
In-Reply-To: <20260505133935.3772495-1-dakr@kernel.org>
All buses have been converted from driver_set_override() to the generic
driver_override infrastructure introduced in commit cb3d1049f4ea
("driver core: generalize driver_override in struct device").
Buses now either opt into the generic sysfs callbacks via the
bus_type::driver_override flag, or use device_set_driver_override() /
__device_set_driver_override() directly.
Thus, remove the now-unused driver_set_override() helper.
Link: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/base/driver.c | 75 -----------------------------------
include/linux/device/driver.h | 2 -
2 files changed, 77 deletions(-)
diff --git a/drivers/base/driver.c b/drivers/base/driver.c
index 8ab010ddf709..7ed834f7199c 100644
--- a/drivers/base/driver.c
+++ b/drivers/base/driver.c
@@ -30,81 +30,6 @@ static struct device *next_device(struct klist_iter *i)
return dev;
}
-/**
- * driver_set_override() - Helper to set or clear driver override.
- * @dev: Device to change
- * @override: Address of string to change (e.g. &device->driver_override);
- * The contents will be freed and hold newly allocated override.
- * @s: NUL-terminated string, new driver name to force a match, pass empty
- * string to clear it ("" or "\n", where the latter is only for sysfs
- * interface).
- * @len: length of @s
- *
- * Helper to set or clear driver override in a device, intended for the cases
- * when the driver_override field is allocated by driver/bus code.
- *
- * Returns: 0 on success or a negative error code on failure.
- */
-int driver_set_override(struct device *dev, const char **override,
- const char *s, size_t len)
-{
- const char *new, *old;
- char *cp;
-
- if (!override || !s)
- return -EINVAL;
-
- /*
- * The stored value will be used in sysfs show callback (sysfs_emit()),
- * which has a length limit of PAGE_SIZE and adds a trailing newline.
- * Thus we can store one character less to avoid truncation during sysfs
- * show.
- */
- if (len >= (PAGE_SIZE - 1))
- return -EINVAL;
-
- /*
- * Compute the real length of the string in case userspace sends us a
- * bunch of \0 characters like python likes to do.
- */
- len = strlen(s);
-
- if (!len) {
- /* Empty string passed - clear override */
- device_lock(dev);
- old = *override;
- *override = NULL;
- device_unlock(dev);
- kfree(old);
-
- return 0;
- }
-
- cp = strnchr(s, len, '\n');
- if (cp)
- len = cp - s;
-
- new = kstrndup(s, len, GFP_KERNEL);
- if (!new)
- return -ENOMEM;
-
- device_lock(dev);
- old = *override;
- if (cp != s) {
- *override = new;
- } else {
- /* "\n" passed - clear override */
- kfree(new);
- *override = NULL;
- }
- device_unlock(dev);
-
- kfree(old);
-
- return 0;
-}
-EXPORT_SYMBOL_GPL(driver_set_override);
-
/**
* driver_for_each_device - Iterator for devices bound to a driver.
* @drv: Driver we're iterating.
diff --git a/include/linux/device/driver.h b/include/linux/device/driver.h
index bbc67ec513ed..aa3465a369f0 100644
--- a/include/linux/device/driver.h
+++ b/include/linux/device/driver.h
@@ -160,8 +160,6 @@ int __must_check driver_create_file(const struct device_driver *driver,
void driver_remove_file(const struct device_driver *driver,
const struct driver_attribute *attr);
-int driver_set_override(struct device *dev, const char **override,
- const char *s, size_t len);
int __must_check driver_for_each_device(struct device_driver *drv, struct device *start,
void *data, device_iter_t fn);
struct device *driver_find_device(const struct device_driver *drv,
--
2.54.0
^ permalink raw reply related
* [PATCH v2 4/5] rpmsg: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-05-05 13:37 UTC (permalink / raw)
To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
wei.liu, decui, longli, andersson, mathieu.poirier
Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
linux-remoteproc, Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260505133935.3772495-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: e95060478244 ("rpmsg: Introduce a driver override mechanism")
Reviewed-by: Mathieu Poirier <mathieu.poirier@linaro.org>
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/rpmsg/qcom_glink_native.c | 2 --
drivers/rpmsg/rpmsg_core.c | 43 +++++--------------------------
drivers/rpmsg/virtio_rpmsg_bus.c | 1 -
include/linux/rpmsg.h | 4 ---
4 files changed, 7 insertions(+), 43 deletions(-)
diff --git a/drivers/rpmsg/qcom_glink_native.c b/drivers/rpmsg/qcom_glink_native.c
index 401a4ece0c97..d9d4468e4cbd 100644
--- a/drivers/rpmsg/qcom_glink_native.c
+++ b/drivers/rpmsg/qcom_glink_native.c
@@ -1626,7 +1626,6 @@ static void qcom_glink_rpdev_release(struct device *dev)
{
struct rpmsg_device *rpdev = to_rpmsg_device(dev);
- kfree(rpdev->driver_override);
kfree(rpdev);
}
@@ -1862,7 +1861,6 @@ static void qcom_glink_device_release(struct device *dev)
/* Release qcom_glink_alloc_channel() reference */
kref_put(&channel->refcount, qcom_glink_channel_release);
- kfree(rpdev->driver_override);
kfree(rpdev);
}
diff --git a/drivers/rpmsg/rpmsg_core.c b/drivers/rpmsg/rpmsg_core.c
index e7f7831d37f8..c56f69c22e42 100644
--- a/drivers/rpmsg/rpmsg_core.c
+++ b/drivers/rpmsg/rpmsg_core.c
@@ -358,33 +358,6 @@ rpmsg_show_attr(src, src, "0x%x\n");
rpmsg_show_attr(dst, dst, "0x%x\n");
rpmsg_show_attr(announce, announce ? "true" : "false", "%s\n");
-static ssize_t driver_override_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct rpmsg_device *rpdev = to_rpmsg_device(dev);
- int ret;
-
- ret = driver_set_override(dev, &rpdev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct rpmsg_device *rpdev = to_rpmsg_device(dev);
- ssize_t len;
-
- device_lock(dev);
- len = sysfs_emit(buf, "%s\n", rpdev->driver_override);
- device_unlock(dev);
- return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
static ssize_t modalias_show(struct device *dev,
struct device_attribute *attr, char *buf)
{
@@ -405,7 +378,6 @@ static struct attribute *rpmsg_dev_attrs[] = {
&dev_attr_dst.attr,
&dev_attr_src.attr,
&dev_attr_announce.attr,
- &dev_attr_driver_override.attr,
NULL,
};
ATTRIBUTE_GROUPS(rpmsg_dev);
@@ -424,9 +396,11 @@ static int rpmsg_dev_match(struct device *dev, const struct device_driver *drv)
const struct rpmsg_driver *rpdrv = to_rpmsg_driver(drv);
const struct rpmsg_device_id *ids = rpdrv->id_table;
unsigned int i;
+ int ret;
- if (rpdev->driver_override)
- return !strcmp(rpdev->driver_override, drv->name);
+ ret = device_match_driver_override(dev, drv);
+ if (ret >= 0)
+ return ret;
if (ids)
for (i = 0; ids[i].name[0]; i++)
@@ -535,6 +509,7 @@ static const struct bus_type rpmsg_bus = {
.name = "rpmsg",
.match = rpmsg_dev_match,
.dev_groups = rpmsg_dev_groups,
+ .driver_override = true,
.uevent = rpmsg_uevent,
.probe = rpmsg_dev_probe,
.remove = rpmsg_dev_remove,
@@ -560,11 +535,9 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
device_initialize(dev);
if (driver_override) {
- ret = driver_set_override(dev, &rpdev->driver_override,
- driver_override,
- strlen(driver_override));
+ ret = device_set_driver_override(dev, driver_override);
if (ret) {
- dev_err(dev, "device_set_override failed: %d\n", ret);
+ dev_err(dev, "device_set_driver_override() failed: %d\n", ret);
put_device(dev);
return ret;
}
@@ -573,8 +546,6 @@ int rpmsg_register_device_override(struct rpmsg_device *rpdev,
ret = device_add(dev);
if (ret) {
dev_err(dev, "device_add failed: %d\n", ret);
- kfree(rpdev->driver_override);
- rpdev->driver_override = NULL;
put_device(dev);
}
diff --git a/drivers/rpmsg/virtio_rpmsg_bus.c b/drivers/rpmsg/virtio_rpmsg_bus.c
index 5ae15111fb4f..1b8bb05924af 100644
--- a/drivers/rpmsg/virtio_rpmsg_bus.c
+++ b/drivers/rpmsg/virtio_rpmsg_bus.c
@@ -374,7 +374,6 @@ static void virtio_rpmsg_release_device(struct device *dev)
struct rpmsg_device *rpdev = to_rpmsg_device(dev);
struct virtio_rpmsg_channel *vch = to_virtio_rpmsg_channel(rpdev);
- kfree(rpdev->driver_override);
kfree(vch);
}
diff --git a/include/linux/rpmsg.h b/include/linux/rpmsg.h
index 83266ce14642..2e40eb54155e 100644
--- a/include/linux/rpmsg.h
+++ b/include/linux/rpmsg.h
@@ -41,9 +41,6 @@ struct rpmsg_channel_info {
* rpmsg_device - device that belong to the rpmsg bus
* @dev: the device struct
* @id: device id (used to match between rpmsg drivers and devices)
- * @driver_override: driver name to force a match; do not set directly,
- * because core frees it; use driver_set_override() to
- * set or clear it.
* @src: local address
* @dst: destination address
* @ept: the rpmsg endpoint of this channel
@@ -53,7 +50,6 @@ struct rpmsg_channel_info {
struct rpmsg_device {
struct device dev;
struct rpmsg_device_id id;
- const char *driver_override;
u32 src;
u32 dst;
struct rpmsg_endpoint *ept;
--
2.54.0
^ permalink raw reply related
* [PATCH v2 3/5] Drivers: hv: vmbus: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-05-05 13:37 UTC (permalink / raw)
To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
wei.liu, decui, longli, andersson, mathieu.poirier
Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
linux-remoteproc, Danilo Krummrich, Michael Kelley, Gui-Dong Han
In-Reply-To: <20260505133935.3772495-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Tested-by: Michael Kelley <mhklinux@outlook.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: d765edbb301c ("vmbus: add driver_override support")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/hv/vmbus_drv.c | 43 ++++++++++--------------------------------
include/linux/hyperv.h | 5 -----
2 files changed, 10 insertions(+), 38 deletions(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index d28ff45d4cfd..acfb579828c5 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -538,34 +538,6 @@ static ssize_t device_show(struct device *dev,
}
static DEVICE_ATTR_RO(device);
-static ssize_t driver_override_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct hv_device *hv_dev = device_to_hv_device(dev);
- int ret;
-
- ret = driver_set_override(dev, &hv_dev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct hv_device *hv_dev = device_to_hv_device(dev);
- ssize_t len;
-
- device_lock(dev);
- len = sysfs_emit(buf, "%s\n", hv_dev->driver_override);
- device_unlock(dev);
-
- return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
/* Set up per device attributes in /sys/bus/vmbus/devices/<bus device> */
static struct attribute *vmbus_dev_attrs[] = {
&dev_attr_id.attr,
@@ -596,7 +568,6 @@ static struct attribute *vmbus_dev_attrs[] = {
&dev_attr_channel_vp_mapping.attr,
&dev_attr_vendor.attr,
&dev_attr_device.attr,
- &dev_attr_driver_override.attr,
NULL,
};
@@ -708,9 +679,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
{
const guid_t *guid = &dev->dev_type;
const struct hv_vmbus_device_id *id;
+ int ret;
- /* When driver_override is set, only bind to the matching driver */
- if (dev->driver_override && strcmp(dev->driver_override, drv->name))
+ /* If a driver override is set, only bind to the matching driver */
+ ret = device_match_driver_override(&dev->device, &drv->driver);
+ if (ret == 0)
return NULL;
/* Look at the dynamic ids first, before the static ones */
@@ -718,8 +691,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
if (!id)
id = hv_vmbus_dev_match(drv->id_table, guid);
- /* driver_override will always match, send a dummy id */
- if (!id && dev->driver_override)
+ /*
+ * If there's a matching driver override, this function should succeed,
+ * thus return a dummy device ID if no matching ID is found.
+ */
+ if (!id && ret > 0)
id = &vmbus_device_null;
return id;
@@ -1021,6 +997,7 @@ static const struct dev_pm_ops vmbus_pm = {
/* The one and only one */
static const struct bus_type hv_bus = {
.name = "vmbus",
+ .driver_override = true,
.match = vmbus_match,
.shutdown = vmbus_shutdown,
.remove = vmbus_remove,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 964f1be8150c..c054d7eff622 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1272,11 +1272,6 @@ struct hv_device {
u16 device_id;
struct device device;
- /*
- * Driver name to force a match. Do not set directly, because core
- * frees it. Use driver_set_override() to set or clear it.
- */
- const char *driver_override;
struct vmbus_channel *channel;
struct kset *channels_kset;
--
2.54.0
^ permalink raw reply related
* [PATCH v2 2/5] cdx: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-05-05 13:37 UTC (permalink / raw)
To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
wei.liu, decui, longli, andersson, mathieu.poirier
Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
linux-remoteproc, Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260505133935.3772495-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 2959ab247061 ("cdx: add the cdx bus driver")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/cdx/cdx.c | 40 +++++--------------------------------
include/linux/cdx/cdx_bus.h | 4 ----
2 files changed, 5 insertions(+), 39 deletions(-)
diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
index 9196dc50a48d..d3d230247262 100644
--- a/drivers/cdx/cdx.c
+++ b/drivers/cdx/cdx.c
@@ -156,8 +156,6 @@ static int cdx_unregister_device(struct device *dev,
} else {
cdx_destroy_res_attr(cdx_dev, MAX_CDX_DEV_RESOURCES);
debugfs_remove_recursive(cdx_dev->debugfs_dir);
- kfree(cdx_dev->driver_override);
- cdx_dev->driver_override = NULL;
}
/*
@@ -268,6 +266,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
const struct cdx_driver *cdx_drv = to_cdx_driver(drv);
const struct cdx_device_id *found_id = NULL;
const struct cdx_device_id *ids;
+ int ret;
if (cdx_dev->is_bus)
return false;
@@ -275,7 +274,8 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
ids = cdx_drv->match_id_table;
/* When driver_override is set, only bind to the matching driver */
- if (cdx_dev->driver_override && strcmp(cdx_dev->driver_override, drv->name))
+ ret = device_match_driver_override(dev, drv);
+ if (ret == 0)
return false;
found_id = cdx_match_id(ids, cdx_dev);
@@ -289,7 +289,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
*/
if (!found_id->override_only)
return true;
- if (cdx_dev->driver_override)
+ if (ret > 0)
return true;
ids = found_id + 1;
@@ -453,36 +453,6 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RO(modalias);
-static ssize_t driver_override_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct cdx_device *cdx_dev = to_cdx_device(dev);
- int ret;
-
- if (WARN_ON(dev->bus != &cdx_bus_type))
- return -EINVAL;
-
- ret = driver_set_override(dev, &cdx_dev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct cdx_device *cdx_dev = to_cdx_device(dev);
- ssize_t len;
-
- device_lock(dev);
- len = sysfs_emit(buf, "%s\n", cdx_dev->driver_override);
- device_unlock(dev);
- return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
static ssize_t enable_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
@@ -552,7 +522,6 @@ static struct attribute *cdx_dev_attrs[] = {
&dev_attr_class.attr,
&dev_attr_revision.attr,
&dev_attr_modalias.attr,
- &dev_attr_driver_override.attr,
NULL,
};
@@ -646,6 +615,7 @@ ATTRIBUTE_GROUPS(cdx_bus);
const struct bus_type cdx_bus_type = {
.name = "cdx",
+ .driver_override = true,
.match = cdx_bus_match,
.probe = cdx_probe,
.remove = cdx_remove,
diff --git a/include/linux/cdx/cdx_bus.h b/include/linux/cdx/cdx_bus.h
index b1ba97f6c9ad..f54770f110bc 100644
--- a/include/linux/cdx/cdx_bus.h
+++ b/include/linux/cdx/cdx_bus.h
@@ -137,9 +137,6 @@ struct cdx_controller {
* @enabled: is this bus enabled
* @msi_dev_id: MSI Device ID associated with CDX device
* @num_msi: Number of MSI's supported by the device
- * @driver_override: driver name to force a match; do not set directly,
- * because core frees it; use driver_set_override() to
- * set or clear it.
* @irqchip_lock: lock to synchronize irq/msi configuration
* @msi_write_pending: MSI write pending for this device
*/
@@ -165,7 +162,6 @@ struct cdx_device {
bool enabled;
u32 msi_dev_id;
u32 num_msi;
- const char *driver_override;
struct mutex irqchip_lock;
bool msi_write_pending;
};
--
2.54.0
^ permalink raw reply related
* [PATCH v2 1/5] amba: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-05-05 13:37 UTC (permalink / raw)
To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
wei.liu, decui, longli, andersson, mathieu.poirier
Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
linux-remoteproc, Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260505133935.3772495-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 3cf385713460 ("ARM: 8256/1: driver coamba: add device binding path 'driver_override'")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/amba/bus.c | 37 ++++++-------------------------------
include/linux/amba/bus.h | 5 -----
2 files changed, 6 insertions(+), 36 deletions(-)
diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 6d479caf89cb..d721d64a9858 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -82,33 +82,6 @@ static void amba_put_disable_pclk(struct amba_device *pcdev)
}
-static ssize_t driver_override_show(struct device *_dev,
- struct device_attribute *attr, char *buf)
-{
- struct amba_device *dev = to_amba_device(_dev);
- ssize_t len;
-
- device_lock(_dev);
- len = sprintf(buf, "%s\n", dev->driver_override);
- device_unlock(_dev);
- return len;
-}
-
-static ssize_t driver_override_store(struct device *_dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct amba_device *dev = to_amba_device(_dev);
- int ret;
-
- ret = driver_set_override(_dev, &dev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-static DEVICE_ATTR_RW(driver_override);
-
#define amba_attr_func(name,fmt,arg...) \
static ssize_t name##_show(struct device *_dev, \
struct device_attribute *attr, char *buf) \
@@ -126,7 +99,6 @@ amba_attr_func(resource, "\t%016llx\t%016llx\t%016lx\n",
static struct attribute *amba_dev_attrs[] = {
&dev_attr_id.attr,
&dev_attr_resource.attr,
- &dev_attr_driver_override.attr,
NULL,
};
ATTRIBUTE_GROUPS(amba_dev);
@@ -209,10 +181,11 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
{
struct amba_device *pcdev = to_amba_device(dev);
const struct amba_driver *pcdrv = to_amba_driver(drv);
+ int ret;
mutex_lock(&pcdev->periphid_lock);
if (!pcdev->periphid) {
- int ret = amba_read_periphid(pcdev);
+ ret = amba_read_periphid(pcdev);
/*
* Returning any error other than -EPROBE_DEFER from bus match
@@ -230,8 +203,9 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
mutex_unlock(&pcdev->periphid_lock);
/* When driver_override is set, only bind to the matching driver */
- if (pcdev->driver_override)
- return !strcmp(pcdev->driver_override, drv->name);
+ ret = device_match_driver_override(dev, drv);
+ if (ret >= 0)
+ return ret;
return amba_lookup(pcdrv->id_table, pcdev) != NULL;
}
@@ -436,6 +410,7 @@ static const struct dev_pm_ops amba_pm = {
const struct bus_type amba_bustype = {
.name = "amba",
.dev_groups = amba_dev_groups,
+ .driver_override = true,
.match = amba_match,
.uevent = amba_uevent,
.probe = amba_probe,
diff --git a/include/linux/amba/bus.h b/include/linux/amba/bus.h
index 9946276aff73..6c54d5c0d21f 100644
--- a/include/linux/amba/bus.h
+++ b/include/linux/amba/bus.h
@@ -71,11 +71,6 @@ struct amba_device {
unsigned int cid;
struct amba_cs_uci_id uci;
unsigned int irq[AMBA_NR_IRQS];
- /*
- * Driver name to force a match. Do not set directly, because core
- * frees it. Use driver_set_override() to set or clear it.
- */
- const char *driver_override;
};
struct amba_driver {
--
2.54.0
^ permalink raw reply related
* [PATCH v2 0/5] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-05-05 13:37 UTC (permalink / raw)
To: gregkh, rafael, linux, nipun.gupta, nikhil.agarwal, kys, haiyangz,
wei.liu, decui, longli, andersson, mathieu.poirier
Cc: driver-core, linux-kernel, linux-hyperv, linux-arm-msm,
linux-remoteproc, Danilo Krummrich
This is the follow-up of the driver_override generalization in [1], converting
the remaining 4 busses and removing the now-unused driver_set_override() helper.
All of them are prone to the potential UAF described in [2], caused by accessing
the driver_override field from their corresponding match() callback.
In order to address this, the generalized driver_override field in struct device
is protected with a spinlock. The driver-core provides accessors, such as
device_match_driver_override(), device_has_driver_override() and
device_set_driver_override(), which all ensure proper locking internally.
Additionally, the driver-core provides a driver_override flag in struct
bus_type, which, once enabled, automatically registers generic sysfs callbacks,
allowing userspace to modify the driver_override field.
This series is based on v7.1-rc1 with no additional dependencies, hence those
patches can be picked up by subsystems individually.
[1] https://lore.kernel.org/driver-core/20260303115720.48783-1-dakr@kernel.org/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=220789
[3] https://gitlab.com/driverctl/driverctl/-/blob/0.121/driverctl?ref_type=tags#L99
Changes in v2:
- Rebase on v7.1-rc1
- Drop already merged patches
- vmbus documentation changes as requested by Michael
Danilo Krummrich (5):
amba: use generic driver_override infrastructure
cdx: use generic driver_override infrastructure
Drivers: hv: vmbus: use generic driver_override infrastructure
rpmsg: use generic driver_override infrastructure
driver core: remove driver_set_override()
drivers/amba/bus.c | 37 +++------------
drivers/base/driver.c | 75 -------------------------------
drivers/cdx/cdx.c | 40 +++--------------
drivers/hv/vmbus_drv.c | 43 +++++-------------
drivers/rpmsg/qcom_glink_native.c | 2 -
drivers/rpmsg/rpmsg_core.c | 43 +++---------------
drivers/rpmsg/virtio_rpmsg_bus.c | 1 -
include/linux/amba/bus.h | 5 ---
include/linux/cdx/cdx_bus.h | 4 --
include/linux/device/driver.h | 2 -
include/linux/hyperv.h | 5 ---
include/linux/rpmsg.h | 4 --
12 files changed, 28 insertions(+), 233 deletions(-)
base-commit: 254f49634ee16a731174d2ae34bc50bd5f45e731
--
2.54.0
^ permalink raw reply
* RE: [PATCH v4 1/3] mshv: limit SynIC management to MSHV-owned resources
From: Jork Loeser @ 2026-05-05 11:35 UTC (permalink / raw)
To: Michael Kelley
Cc: linux-hyperv@vger.kernel.org, x86@kernel.org, K . Y . Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, Dave Hansen, H . Peter Anvin,
Arnd Bergmann, Anirudh Rayabharam, linux-kernel@vger.kernel.org,
linux-arch@vger.kernel.org
In-Reply-To: <SN6PR02MB4157280E305E11C5840B444DD4312@SN6PR02MB4157.namprd02.prod.outlook.com>
On Mon, 4 May 2026, Michael Kelley wrote:
> From: Jork Loeser <jloeser@linux.microsoft.com> Sent: Monday, April 27, 2026 2:39 PM
>>
>> While here, fix the SIEFP and SIRBP memremap() and virt_to_phys()
>> calls to use HV_HYP_PAGE_SHIFT/HV_HYP_PAGE_SIZE instead of
>> PAGE_SHIFT/PAGE_SIZE. The hypervisor always uses 4K pages for SynIC
>> register GPAs regardless of the kernel page size, so using PAGE_SHIFT
>> produces wrong addresses on ARM64 with 64K pages.
>
> I agree that this is a good change. But any kernel image built with
> CONFIG_MSHV_ROOT set must use only 4KiB pages, as enforced
> by the dependency in drivers/hv/Kconfig. The change makes the
> code explicitly match the SynIC register layout, which is good,
> but it doesn't actually fix a problem since root MSHV code can't
> run on ARM64 with 64KiB pages. My only concern is that this
> commit message should not imply that an ARM64/64KiB
> configuration is possible for the root.
Agree for root. For L1VH, I think it theoretically could (likely with
other changes needing to happen elsewhere).
Best,
Jork
^ permalink raw reply
* Re: [PATCH 0/3] net: mana: Fix mana_destroy_rxq() cleanup for partial RXQ init
From: patchwork-bot+netdevbpf @ 2026-05-05 10:20 UTC (permalink / raw)
To: Dipayaan Roy
Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees,
john.fastabend, hawk, bpf, daniel, ast, sdf, yury.norov
In-Reply-To: <20260430035935.1859220-1-dipayanroy@linux.microsoft.com>
Hello:
This series was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Wed, 29 Apr 2026 20:57:51 -0700 you wrote:
> When mana_create_rxq() fails partway through initialization (e.g. the
> hardware rejects the WQ object creation), the error path calls
> mana_destroy_rxq() to tear down a partially-initialized RXQ.
> This exposed multiple issues in mana_destroy_rxq() path, as it assumed
> the RXQ was always fully initialized, leading to multiple issues:
>
> 1. xdp_rxq_info_unreg() was called on an unregistered xdp_rxq,
> triggering a WARN_ON ("Driver BUG") in net/core/xdp.c.
>
> [...]
Here is the summary with links:
- [1/3] net: mana: check xdp_rxq registration before unreg in mana_destroy_rxq()
https://git.kernel.org/netdev/net/c/e9e334f8063a
- [2/3] net: mana: Skip WQ object destruction for uninitialized RXQ
https://git.kernel.org/netdev/net/c/2a1c69118282
- [3/3] net: mana: remove double CQ cleanup in mana_create_rxq error path
https://git.kernel.org/netdev/net/c/3985c9a56da4
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net v2] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-05-05 6:15 UTC (permalink / raw)
To: Yury Norov
Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li, Yury Norov,
linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
Saurabh Singh Sengar, stable
In-Reply-To: <afYxOPL4DNjXM7tL@yury>
On Sat, May 02, 2026 at 01:15:36PM -0400, Yury Norov wrote:
> On Sat, May 02, 2026 at 07:37:43AM -0700, Shradha Gupta wrote:
> > On Fri, May 01, 2026 at 12:22:20PM -0400, Yury Norov wrote:
> > > On Wed, Apr 29, 2026 at 02:06:37AM -0700, Shradha Gupta wrote:
> > > > In mana driver, the number of IRQs allocated is capped by the
> > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > > their NUMA/core bindings.
> > > >
> > > > This is important, especially in the envs where number of vCPUs are so
> > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > > much more than their overheads if they were spread across sibling vCPUs.
> > > >
> > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > same vCPU, while some vCPUs have none.
> > > >
> > > > In such cases when many parallel TCP connections are tested, the
> > > > throughput drops significantly.
> > > >
> > > > Test envs:
> > > > =======================================================
> > > > Case 1: without this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > >
> > > > TYPE effective vCPU aff
> > > > =======================================================
> > > > IRQ0: HWC 0
> > > > IRQ1: mana_q1 0
> > > > IRQ2: mana_q2 2
> > > > IRQ3: mana_q3 0
> > > > IRQ4: mana_q4 3
> > > >
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU 0 1 2 3
> > > > =======================================================
> > > > pass 1: 38.85 0.03 24.89 24.65
> > > > pass 2: 39.15 0.03 24.57 25.28
> > > > pass 3: 40.36 0.03 23.20 23.17
> > > >
> > > > =======================================================
> > > > Case 2: with this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > >
> > > > TYPE effective vCPU aff
> > > > =======================================================
> > > > IRQ0: HWC 0
> > > > IRQ1: mana_q1 0
> > > > IRQ2: mana_q2 1
> > > > IRQ3: mana_q3 2
> > > > IRQ4: mana_q4 3
> > > >
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU 0 1 2 3
> > > > =======================================================
> > > > pass 1: 15.42 15.85 14.99 14.51
> > > > pass 2: 15.53 15.94 15.81 15.93
> > > > pass 3: 16.41 16.35 16.40 16.36
> > > >
> > > > =======================================================
> > > > Throughput Impact(in Gbps, same env)
> > > > =======================================================
> > > > TCP conn with patch w/o patch
> > > > 20480 15.65 7.73
> > > > 10240 15.63 8.93
> > > > 8192 15.64 9.69
> > > > 6144 15.64 13.16
> > > > 4096 15.69 15.75
> > > > 2048 15.69 15.83
> > > > 1024 15.71 15.28
> > > >
> > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > Cc: stable@vger.kernel.org
> > > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > ---
> > > > Changes in v2
> > > > * Removed the unused skip_first_cpu variable
> > > > * fixed exit condition in irq_setup_linear() with len == 0
> > > > * changed return type of irq_setup_linear() as it will always be 0
> > > > * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > > * added appropriate comments to indicate expected behaviour when
> > > > IRQs are more than or equal to num_online_cpus()
> > > > ---
> > > > .../net/ethernet/microsoft/mana/gdma_main.c | 47 ++++++++++++++++---
> > > > 1 file changed, 40 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 098fbda0d128..d740d1dc43da 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -167,6 +167,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > > } else {
> > > > /* If dynamic allocation is enabled we have already allocated
> > > > * hwc msi
> > > > + * Also, we make sure in this case the following is always true
> > > > + * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > > */
> > > > gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > > }
> > > > @@ -1672,11 +1674,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > > return 0;
> > > > }
> > > >
> > > > +/* should be called with cpus_read_lock() held */
> > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > +{
> > > > + int cpu;
> > > > +
> > > > + for_each_online_cpu(cpu) {
> > > > + if (len == 0)
> > > > + break;
> > > > +
> > > > + irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > + len--;
> > > > + }
> > > > +}
> > > > +
> > > > static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > {
> > > > struct gdma_context *gc = pci_get_drvdata(pdev);
> > > > struct gdma_irq_context *gic;
> > > > - bool skip_first_cpu = false;
> > > > int *irqs, irq, err, i;
> > > >
> > > > irqs = kmalloc_objs(int, nvec);
> > >
> > > So what about WARN_ON() and nvec adjustment before kmalloc?
> > Hey Yury,
> >
> > I am still a bit unsure about the WARN_ON() before kmalloc, as after
> > that also, in the same function till we take the cpus_read_lock() the
> > num_online_cpus() can change(or reduce). That's why I introduced the
> > dev_dbg() to capture hot-remove edge case.
>
> OK.
>
> > Do you still think it adds more value?
>
> It's your driver, so you know better. I just wonder because you said
> it's good to add WARN_ON(), and then didn't do that.
>
> > >
> > > > @@ -1722,13 +1737,31 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > > * first CPU sibling group since they are already affinitized to HWC IRQ
> > > > */
> > > > cpus_read_lock();
> > > > - if (gc->num_msix_usable <= num_online_cpus())
> > > > - skip_first_cpu = true;
> > > > + if (gc->num_msix_usable <= num_online_cpus()) {
> > > > + err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > > + if (err) {
> > > > + cpus_read_unlock();
> > > > + goto free_irq;
> > >
> > > One thing puzzles me: if you skip first CPU with this 'true', and the
> > > gc->num_msix_usable == num_online_cpus(), it's one more than you can
> > > distribute. What do I miss?
> > >
> >
> > Let me explain this case a bit better then,
> >
> > - num_msix_usable = HWC IRQ + Queue IRQ
> > - nvec in this functions is only Queue IRQ (HWC already setup)
> >
> > When num_online_cpus == num_msix_usable:
> > - nvec = num_online_cpus - 1
> > - first CPU is already assigned to HWC IRQ, so skip it
> > - Queue IRQs fit in the remaining CPUs
> >
> > please let me know if I did not get your question right
>
> Can you put that in a comment?
Sure I will. thanks
>
> > > > + }
> > > > + } else {
> > > > + /*
> > > > + * When num_msix_usable are more than num_online_cpus, we try to
> > > > + * make sure we are using all vcpus. In such a case NUMA or
> > > > + * CPU core affinity does not matter.
> > >
> > > If it doesn't matter, why don't you assign each IRQ to all CPUs then?
> > > In theory, the system would have most of flexibility to balance them.
> > >
> >
> > Okay, let me fix the comment and elaborate on this. It doesn't matter
> > because in such a case we want to anyway exhaust and distribute the
> > Queue IRQs to all vCPUs.
> > We don't want to rely on the system's balancer in this case as it could
> > be skewed by other devices' IRQ weights
>
> I don't understand this. If I want to reserve some CPUs to solely
> handle IRQs from my high-priority hardware, then I configure my system
> accordingly. For example, assign all non-networking IRQs on CPU0, and
> all networking IRQs to all CPUs.
>
> In your case, you distribute IRQs evenly, which means you've no
> preferred CPUs. So, assuming the system is only running your IRQ
> driver, it's at max is as good as all-CPU distribution. In case of
> heavy loading some particular CPU, your scheme could cause
> corresponding IRQs to starve.
>
> I recall, when we was working on irq_setup(), the original idea was to
> distribute IRQs one-to-one, but than I suggested the
>
> irq_set_affinity_and_hint(*irqs++, topology_sibling_cpumask(cpu));
>
> and after experiments, you agreed on that.
>
> Can you please run your throughput test for my suggested distribution
> too? Would be also nice to see how each distribution works when some
> CPUs are under stress.
>
> Thanks,
> Yury
The design of irq_setup() works exactly how we want it for our IRQs for
almost all of our usecases, so we want to keep that as is. The only
scenarios where this is an issue in terms of significant throughput drop
is when we are working with low vCPU VMs (vCPU <= 4 with high TCP
connection counts) and where there are additional NVMe devices attached
to the VM.
The current patch about utilizing all the vCPUs helps in that case and
doesn't cause any regression for other cases.
This linear path is only taken when num_msix_usable > num_online_cpus(),
which is limited to low-vCPU VMs. Larger VMs continue using irq_setup()
as before.
We can definately get our throughput run results on other suggestions
you have. And about that, I just needed a bit more clarity on what to
test against. Are you suggesting, with irq_setup() intact and in use, we
configure the non-mana IRQs to say CPU0 and capture the numbers?
Thanks,
Shradha.
^ permalink raw reply
* Re: [PATCH v3] mshv: Simplify GPA map/unmap hypercall helpers
From: Anirudh Rayabharam @ 2026-05-05 6:13 UTC (permalink / raw)
To: Stanislav Kinsburskii
Cc: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel
In-Reply-To: <177756065245.17889.140699174692055235.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
On Thu, Apr 30, 2026 at 02:52:17PM +0000, Stanislav Kinsburskii wrote:
> Clean up hv_do_map_gpa_hcall() and hv_call_unmap_gpa_pages() after the
> preceding bug-fix patches:
>
> Move "done += completed" before the status checks so that pages mapped
> by a partially-successful batch are included in the error cleanup unmap.
> Previously these mappings were leaked on failure.
>
> While here, improve type safety and readability:
> - Change "int done" to "u64 done" to match the u64 page_count it is
> compared against, avoiding signed/unsigned comparison hazards.
> - Use u64 for loop iteration and batch size variables consistently.
> - Add proper braces to the for-loop body in hv_do_map_gpa_hcall().
> - Remove unnecessary "ret" variable from hv_call_unmap_gpa_pages().
> - Simplify the error-path unmap to use "done << large_shift" directly
> instead of mutating done in place.
>
> v3: aligned changes by 80 colons
> v2: replaced min with min_t
This part describing the changes in various version should be placed
after the "---" line below. This way it won't appear in the final commit
log.
https://www.kernel.org/doc/html/latest/process/submitting-patches.html#commentary
Thanks,
Anirudh.
>
> Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
> Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> ---
> drivers/hv/mshv_root_hv_call.c | 56 +++++++++++++++-------------------------
> 1 file changed, 21 insertions(+), 35 deletions(-)
>
> diff --git a/drivers/hv/mshv_root_hv_call.c b/drivers/hv/mshv_root_hv_call.c
> index e5992c324904a..e1f9e28d5a19b 100644
> --- a/drivers/hv/mshv_root_hv_call.c
> +++ b/drivers/hv/mshv_root_hv_call.c
> @@ -195,8 +195,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> struct hv_input_map_gpa_pages *input_page;
> u64 status, *pfnlist;
> unsigned long irq_flags, large_shift = 0;
> - int ret = 0, done = 0;
> - u64 page_count = page_struct_count;
> + u64 done = 0, page_count = page_struct_count;
> + int ret = 0;
>
> if (page_count == 0 || (pages && mmio_spa))
> return -EINVAL;
> @@ -213,8 +213,8 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> }
>
> while (done < page_count) {
> - ulong i, completed, remain = page_count - done;
> - int rep_count = min(remain, HV_MAP_GPA_BATCH_SIZE);
> + u64 i, completed, remain = page_count - done;
> + u64 rep_count = min_t(u64, remain, HV_MAP_GPA_BATCH_SIZE);
>
> local_irq_save(irq_flags);
> input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -224,23 +224,14 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> input_page->map_flags = flags;
> pfnlist = input_page->source_gpa_page_list;
>
> - for (i = 0; i < rep_count; i++)
> - if (flags & HV_MAP_GPA_NO_ACCESS) {
> + for (i = 0; i < rep_count; i++) {
> + if (flags & HV_MAP_GPA_NO_ACCESS)
> pfnlist[i] = 0;
> - } else if (pages) {
> - u64 index = (done + i) << large_shift;
> -
> - if (index >= page_struct_count) {
> - ret = -EINVAL;
> - break;
> - }
> - pfnlist[i] = page_to_pfn(pages[index]);
> - } else {
> + else if (pages)
> + pfnlist[i] = page_to_pfn(pages[(done + i) <<
> + large_shift]);
> + else
> pfnlist[i] = mmio_spa + done + i;
> - }
> - if (ret) {
> - local_irq_restore(irq_flags);
> - break;
> }
>
> status = hv_do_rep_hypercall(HVCALL_MAP_GPA_PAGES, rep_count, 0,
> @@ -248,29 +239,26 @@ static int hv_do_map_gpa_hcall(u64 partition_id, u64 gfn, u64 page_struct_count,
> local_irq_restore(irq_flags);
>
> completed = hv_repcomp(status);
> + done += completed;
>
> if (hv_result_needs_memory(status)) {
> ret = hv_call_deposit_pages(NUMA_NO_NODE, partition_id,
> HV_MAP_GPA_DEPOSIT_PAGES);
> if (ret)
> break;
> -
> } else if (!hv_result_success(status)) {
> ret = hv_result_to_errno(status);
> break;
> }
> -
> - done += completed;
> }
>
> if (ret && done) {
> u32 unmap_flags = 0;
>
> - if (flags & HV_MAP_GPA_LARGE_PAGE) {
> + if (flags & HV_MAP_GPA_LARGE_PAGE)
> unmap_flags |= HV_UNMAP_GPA_LARGE_PAGE;
> - done <<= large_shift;
> - }
> - hv_call_unmap_gpa_pages(partition_id, gfn, done, unmap_flags);
> + hv_call_unmap_gpa_pages(partition_id, gfn,
> + done << large_shift, unmap_flags);
> }
>
> return ret;
> @@ -305,7 +293,7 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> struct hv_input_unmap_gpa_pages *input_page;
> u64 status, page_count = page_count_4k;
> unsigned long irq_flags, large_shift = 0;
> - int ret = 0, done = 0;
> + u64 done = 0;
>
> if (page_count == 0)
> return -EINVAL;
> @@ -319,8 +307,8 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> }
>
> while (done < page_count) {
> - ulong completed, remain = page_count - done;
> - int rep_count = min(remain, HV_UMAP_GPA_PAGES);
> + u64 completed, remain = page_count - done;
> + u64 rep_count = min_t(u64, remain, HV_UMAP_GPA_PAGES);
>
> local_irq_save(irq_flags);
> input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> @@ -333,15 +321,13 @@ int hv_call_unmap_gpa_pages(u64 partition_id, u64 gfn, u64 page_count_4k,
> local_irq_restore(irq_flags);
>
> completed = hv_repcomp(status);
> - if (!hv_result_success(status)) {
> - ret = hv_result_to_errno(status);
> - break;
> - }
> -
> done += completed;
> +
> + if (!hv_result_success(status))
> + return hv_result_to_errno(status);
> }
>
> - return ret;
> + return 0;
> }
>
> int hv_call_get_gpa_access_states(u64 partition_id, u32 count, u64 gpa_base_pfn,
>
>
^ permalink raw reply
* [PATCH v2] mshv: support 1G hugepages by passing them as 2M-aligned chunks
From: Anirudh Rayabharam (Microsoft) @ 2026-05-05 5:58 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li
Cc: linux-hyperv, linux-kernel, Anirudh Rayabharam (Microsoft)
The hypervisor's map GPA hypercall coalesces contiguous 2M-aligned chunks
into 1G mappings when alignment permits, so the driver can support 1G
hugepages by feeding them in as 2M chunks. Note that this is the only
way to make 1G mappings. There is no way to directly map a 1G hugepage
using the hypercall.
Update mshv_chunk_stride() to:
- Accept 2M-aligned tail pages of a larger folio. The previous
PageHead() check rejected every page after the head of a 1G hugepage
and fell back to 4K mappings for the remaining 1022 MB. Replace it
with a PFN alignment check so any 2M-aligned page of a sufficiently
large folio is acceptable.
- Allow only PMD_ORDER and PUD_ORDER folios. Other compound orders
(e.g. mTHP) remain unsupported and are rejected with -EINVAL plus a
pr_err(); silently mapping them with HV_MAP_GPA_LARGE_PAGE would
fail anyway.
Cap the stride returned by mshv_chunk_stride() against page_count in
mshv_region_process_chunk() to avoid overflow situations. For example,
The natural stride of a 1G folio (1 << PUD_ORDER = 262144 pages) exceeds
the per-fault batch (MSHV_MAP_FAULT_IN_PAGES = 512), which would otherwise
cause an out-of-bounds read past mreg_pages.
Assisted-by: Copilot-CLI:claude-opus-4.7
Signed-off-by: Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
---
Changes in v2:
- Handled the case where we can have 2M aligned pages in the middle of a
1G page
- Brought back the page order check but expanded it to include 1G
- Clamp stride to requested page count in mshv_region_process_chunk
- Link to v1: https://lore.kernel.org/r/20260416-huge_1g-v1-1-e066738cddfb@anirudhrb.com
---
drivers/hv/mshv_regions.c | 17 ++++++++++++-----
1 file changed, 12 insertions(+), 5 deletions(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index fdffd4f002f6..8a9d7fb558c4 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -29,6 +29,8 @@
* Uses huge page stride if the backing page is huge and the guest mapping
* is properly aligned; otherwise falls back to single page stride.
*
+ * Only PMD_ORDER (2 MB) and PUD_ORDER (1 GB) folios are supported.
+ *
* Return: Stride in pages, or -EINVAL if page order is unsupported.
*/
static int mshv_chunk_stride(struct page *page,
@@ -38,18 +40,21 @@ static int mshv_chunk_stride(struct page *page,
/*
* Use single page stride by default. For huge page stride, the
- * page must be compound and point to the head of the compound
- * page, and both gfn and page_count must be huge-page aligned.
+ * page must be compound, the page's PFN must itself be 2 M-aligned
+ * (so that a 2M-aligned tail page of a larger folio is acceptable),
+ * and both gfn and page_count must be huge-page aligned.
*/
- if (!PageCompound(page) || !PageHead(page) ||
+ if (!PageCompound(page) ||
+ !IS_ALIGNED(page_to_pfn(page), PTRS_PER_PMD) ||
!IS_ALIGNED(gfn, PTRS_PER_PMD) ||
!IS_ALIGNED(page_count, PTRS_PER_PMD))
return 1;
page_order = folio_order(page_folio(page));
- /* The hypervisor only supports 2M huge page */
- if (page_order != PMD_ORDER)
+ if (page_order != PMD_ORDER && page_order != PUD_ORDER) {
+ pr_err("mshv: unsupported folio order %u\n", page_order);
return -EINVAL;
+ }
return 1 << page_order;
}
@@ -96,6 +101,8 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
if (stride < 0)
return stride;
+ stride = min_t(u64, stride, page_count);
+
/* Start at stride since the first stride is validated */
for (count = stride; count < page_count; count += stride) {
page = region->mreg_pages[page_offset + count];
---
base-commit: cd9f2e7d6e5b1837ef40b96e300fa28b73ab5a77
change-id: 20260416-huge_1g-e44461393c8f
Best regards,
--
Anirudh Rayabharam (Microsoft) <anirudh@anirudhrb.com>
^ permalink raw reply related
* Re: [PATCH net-next v2] net: mana: hardening: Reject zero max_num_queues from GDMA_QUERY_MAX_RESOURCES
From: patchwork-bot+netdevbpf @ 2026-05-05 2:30 UTC (permalink / raw)
To: Erni Sri Satya Vennela
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, horms, shradhagupta, dipayanroy,
yury.norov, linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260430083627.1873757-1-ernis@linux.microsoft.com>
Hello:
This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 30 Apr 2026 01:36:21 -0700 you wrote:
> In a CVM environment, hardware responses cannot be trusted. The
> GDMA_QUERY_MAX_RESOURCES command returns resource limits used to
> determine the maximum number of queues.
>
> In mana_gd_query_max_resources(), gc->max_num_queues is initialized
> from num_online_cpus() and successively clamped by the hardware-reported
> max_eq, max_cq, max_sq, max_rq, and num_msix_usable values. If any of
> these hardware values is zero, gc->max_num_queues becomes zero and the
> function returns success. This leads to a confusing failure later when
> alloc_etherdev_mq() is called with zero queues, returning NULL and
> producing a misleading -ENOMEM error.
>
> [...]
Here is the summary with links:
- [net-next,v2] net: mana: hardening: Reject zero max_num_queues from GDMA_QUERY_MAX_RESOURCES
https://git.kernel.org/netdev/net-next/c/f7622e58e802
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* Re: [PATCH net-next v2] net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG
From: patchwork-bot+netdevbpf @ 2026-05-05 2:30 UTC (permalink / raw)
To: Erni Sri Satya Vennela
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, kuba, pabeni, dipayanroy, shirazsaleem, kees,
linux-hyperv, netdev, linux-kernel
In-Reply-To: <20260430085638.1875400-1-ernis@linux.microsoft.com>
Hello:
This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Thu, 30 Apr 2026 01:56:31 -0700 you wrote:
> As a part of MANA hardening for CVM, validate that max_num_sq and
> max_num_rq returned by MANA_QUERY_VPORT_CONFIG are not zero. These
> values flow into apc->num_queues, which is used as an allocation count
> and loop bound. A zero value would result in zero-size allocations and
> incorrect driver behavior.
>
> Return -EPROTO if either value is zero.
>
> [...]
Here is the summary with links:
- [net-next,v2] net: mana: hardening: Reject zero max_num_queues from MANA_QUERY_VPORT_CONFIG
https://git.kernel.org/netdev/net-next/c/93ca1575dd1f
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH v2] Drivers: hv: vmbus: Improve the logic of reserving fb_mmio on Gen2 VMs
From: Dexuan Cui @ 2026-05-05 0:48 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli, linux-hyperv, linux-kernel,
mhklinux, matthew.ruffell, johansen, hargar
Cc: stable
If vmbus_reserve_fb() in the kdump/kexec kernel fails to properly reserve
the framebuffer MMIO range (which is below 4GB) due to a Gen2 VM's
screen.lfb_base being zero [1], there is an MMIO conflict between the
drivers hyperv-drm and pci-hyperv: when the driver pci-hyperv's
hv_pci_allocate_bridge_windows() calls vmbus_allocate_mmio() to get a
32-bit MMIO range, it may get an MMIO range that overlaps with the
framebuffer MMIO range, and later hv_pci_enter_d0() fails with an
error message "PCI Pass-through VSP failed D0 Entry with status" since
the host thinks that PCI devices must not use MMIO space that the
host has assigned to the framebuffer.
This is especially an issue if pci-hyperv is built-in and hyperv-drm is
built as a module. Consequently, the kdump/kexec kernel fails to detect
PCI devices via pci-hyperv, and may fail to mount the root file system,
which may reside in a NVMe disk. The issue described here has existed
for SR-IOV VF NICs since day one of the pci-hyperv driver, and has been
worked around on x64 when possible. With the recent introduction of
ARM64 VMs that boot from NVMe, there is no workaround, so we need a
formal fix.
On Gen2 VMs, if the screen.lfb_base is 0 in the kdump/kexec kernel [1],
fall back to the low MMIO base, which should be equal to the framebuffer
MMIO base [2] (the statement is true according to my testing on x64
Windows Server 2016, and on x64 and ARM64 Windows Server 2025 and on
Azure. I checked with the Hyper-V team and they said the statement should
continue to be true for Gen2 VMs). In the first kernel, screen.lfb_base
is not 0; if the user specifies a very high resolution, it's not enough
to only reserve 8MB: in this case, reserve half of the space below 4GB,
but cap the reservation to 128MB, which is the required framebuffer size
of the highest resolution 7680*4320 supported by Hyper-V.
While at it, fix the comparison "end > VTPM_BASE_ADDRESS" by changing
the > to >=. Here the 'end' is an inclusive end (typically, it's
0xFFFF_FFFF for the low MMIO range).
Note: vmbus_reserve_fb() now also reserves an MMIO range at the beginning
of the low MMIO range on CVMs, which have no framebuffers (the
'screen.lfb_base' in vmbus_reserve_fb() is 0 for CVMs), just in case the
host might treat the beginning of the low MMIO range specially [4]. BTW,
the OpenHCL kernel is not affected by the change, because that kernel
boots with DeviceTree rather than ACPI (so vmbus_reserve_fb() won't run
there), and there is no framebuffer device for that kernel.
Note: normally Gen1 VMs don't have the MMIO conflict issue because the
framebuffer MMIO range (which is hardcoded to base=4GB-128MB and
size=64MB for Gen1 VMs by the host) is always reported via the legacy PCI
graphics device's BAR, so the kdump/kexec kernel can reserve the 64MB
MMIO range; however, if the VM is configured to use a very high resolution
and the required framebuffer size exceeds 64MB (AFAIK, in practice, this
isn't a typical configuration by users), the hyperv-drm driver may need to
allocate an MMIO range above 4GB and change the framebuffer MMIO location
to the allocated MMIO range -- in this case, there can still be issues [3]
which can't be easily fixed: any possible affected Gen1 users would have
to use a resolution whose framebuffer size is <= 64MB, or switch to Gen2
VMs.
[1] https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
[2] https://lore.kernel.org/all/SA1PR21MB69218F955B62DFF62E3E88D2BF222@SA1PR21MB6921.namprd21.prod.outlook.com/
[3] https://lore.kernel.org/all/SA1PR21MB69213486F821CA5A2C793C81BF342@SA1PR21MB6921.namprd21.prod.outlook.com/
[4] https://lore.kernel.org/all/SN6PR02MB415726B17D5A6027CD1717E8D4342@SN6PR02MB4157.namprd02.prod.outlook.com/
Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
CC: stable@vger.kernel.org
Signed-off-by: Dexuan Cui <decui@microsoft.com>
---
Changes since v1 (https://lore.kernel.org/all/20260416183529.838321-1-decui@microsoft.com/):
Fixed a typo in the subject: s/logc/logic/.
In the commit message, better explained fb_mmio_base is equal to
low_mmio_base for Gen2 VMs.
Addressed Michael Kelley's comments:
In the commit message:
Changed the "kdump" to "kdump/kexec" since the described
issue is applicable to both kdump and kexec.
Provided more detail about the MMIO conflict.
Described an scenario where Gen1 VMs can also be affected.
Added a pr_warn() in vmbus_reserve_fb() in case the 'start' is 0.
Dropped the CVM check in vmbus_reserve(), meaning vmbus_reserve_fb()
also reserves MMIO for CVMs.
Changed "low_mmio_base >= SZ_4G" to "upper_32_bits(low_mmio_base)"
to avoid a compilation warning for the i386 build.
Changed "0x%pa" to "%pa", because %pa already adds a "0x" prefix.
Hi Krister, Matthew, sorry -- I'm not adding your Tested-by's since
the code changed, though the change is small. If the v2 looks good
to Michael, please test the patch again.
Hi Hardik, I'm not adding your Reviewed-by since the patch changed.
Please review the v2.
drivers/hv/vmbus_drv.c | 29 ++++++++++++++++++++++++++---
1 file changed, 26 insertions(+), 3 deletions(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index f0d0803d1e16..d73ac5c8dd04 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -2327,8 +2327,8 @@ static acpi_status vmbus_walk_resources(struct acpi_resource *res, void *ctx)
return AE_NO_MEMORY;
/* If this range overlaps the virtual TPM, truncate it. */
- if (end > VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
- end = VTPM_BASE_ADDRESS;
+ if (end >= VTPM_BASE_ADDRESS && start < VTPM_BASE_ADDRESS)
+ end = VTPM_BASE_ADDRESS - 1;
new_res->name = "hyperv mmio";
new_res->flags = IORESOURCE_MEM;
@@ -2395,6 +2395,7 @@ static void vmbus_mmio_remove(void)
static void __maybe_unused vmbus_reserve_fb(void)
{
resource_size_t start = 0, size;
+ resource_size_t low_mmio_base;
struct pci_dev *pdev;
if (efi_enabled(EFI_BOOT)) {
@@ -2402,6 +2403,24 @@ static void __maybe_unused vmbus_reserve_fb(void)
if (IS_ENABLED(CONFIG_SYSFB)) {
start = sysfb_primary_display.screen.lfb_base;
size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);
+
+ low_mmio_base = hyperv_mmio->start;
+ if (!low_mmio_base || upper_32_bits(low_mmio_base) ||
+ (start && start < low_mmio_base)) {
+ pr_warn("Unexpected low mmio base %pa\n", &low_mmio_base);
+ } else {
+ /*
+ * If the kdump kernel's lfb_base is 0,
+ * fall back to the low mmio base.
+ */
+ if (!start)
+ start = low_mmio_base;
+ /*
+ * Reserve half of the space below 4GB for high
+ * resolutions, but cap the reservation to 128MB.
+ */
+ size = min((SZ_4G - start) / 2, SZ_128M);
+ }
}
} else {
/* Gen1 VM: get FB base from PCI */
@@ -2422,8 +2441,10 @@ static void __maybe_unused vmbus_reserve_fb(void)
pci_dev_put(pdev);
}
- if (!start)
+ if (!start) {
+ pr_warn("Unexpected framebuffer mmio base of zero\n");
return;
+ }
/*
* Make a claim for the frame buffer in the resource tree under the
@@ -2433,6 +2454,8 @@ static void __maybe_unused vmbus_reserve_fb(void)
*/
for (; !fb_mmio && (size >= 0x100000); size >>= 1)
fb_mmio = __request_region(hyperv_mmio, start, size, fb_mmio_name, 0);
+
+ pr_info("hv_mmio=%pR,%pR fb=%pR\n", hyperv_mmio, hyperv_mmio->sibling, fb_mmio);
}
/**
--
2.34.1
^ permalink raw reply related
* RE: [EXTERNAL] Re: [PATCH net-next v6 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Long Li @ 2026-05-04 22:30 UTC (permalink / raw)
To: Simon Horman
Cc: Konstantin Taranov, kuba@kernel.org, davem@davemloft.net,
pabeni@redhat.com, edumazet@google.com, andrew+netdev@lunn.ch,
jgg@ziepe.ca, leon@kernel.org, Haiyang Zhang, KY Srinivasan,
wei.liu@kernel.org, Dexuan Cui, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20260502152649.292433-5-horms@kernel.org>
> From: 'Simon Horman' <horms@kernel.org>
>
> This is an AI-generated review of your patch. The human sending this email has
> considered the AI review valid, or at least plausible.
> Full review at:
> https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnetdev-
> ai.bots.linux.dev%2Fsashiko%2F&data=05%7C02%7Clongli%40microsoft.com%7
> Ca5d48bdebd0240e5d91308dea85ff3a8%7C72f988bf86f141af91ab2d7cd011db4
> 7%7C1%7C0%7C639133327234427106%7CUnknown%7CTWFpbGZsb3d8eyJFbX
> B0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbC
> IsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=4doWStoTQ4w7CbmhN3I294b%
> 2B9c3NNjDGySSmISHGZk4%3D&reserved=0
>
> I apologise for any overlap with this and the review of this patchset available at
> sashiko.dev. They both seem relevant and I'm unsure how to cleanly pass them
> both on.
>
> Both Sashiko instances also flagged concerns in the remaining patches of this
> patchset. But I am not forwarding them as they seem to be lower priority, and I
> think I've already created enough noise here.
>
> ---
> net: mana: Query device capabilities and configure MSI-X sharing for EQs
>
> This patch queries device capabilities early in GDMA probe and chooses between
> dedicated MSI-X per vPort or a shared-MSI-X mode based on the available vector
> count, introducing gc->max_num_queues_vport,
> gc->msi_sharing and gc->msi_bitmap, and renaming mana_query_device_cfg()
> to mana_gd_query_device_cfg() so it can be called from the GDMA layer.
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 098fbda0d128..b96859e0aec9 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -149,6 +149,9 @@ static int mana_gd_query_max_resources(struct
> pci_dev *pdev)
> > struct gdma_context *gc = pci_get_drvdata(pdev);
> > struct gdma_query_max_resources_resp resp = {};
> > struct gdma_general_req req = {};
> > + unsigned int max_num_queues;
> > + u8 bm_hostmode;
> > + u16 num_ports;
> > int err;
> >
> > mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
> @@ -194,6
> > +197,40 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > if (gc->max_num_queues > gc->num_msix_usable - 1)
> > gc->max_num_queues = gc->num_msix_usable - 1;
> >
> > + err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
> MANA_MINOR_VERSION,
> > + MANA_MICRO_VERSION, &num_ports,
> &bm_hostmode);
> > + if (err)
> > + return err;
>
> Is this the same MANA_QUERY_DEV_CONFIG that mana_probe() issues shortly
> after? mana_probe() still contains:
>
> err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION,
> MANA_MINOR_VERSION,
> MANA_MICRO_VERSION, &num_ports,
> &bm_hostmode);
>
> so the same HWC round-trip runs twice on every probe with identical arguments,
> both populating gc->adapter_mtu.
>
> The commit message frames this rename as consolidation ("used at GDMA device
> probe time"), but the mana_probe() call-site is not removed. Was one of the two
> call-sites meant to go away?
>
> The relocation of debugfs_create_u16("adapter-MTU", ...) from
> mana_gd_query_device_cfg() into mana_probe() only makes sense if the function
> is now invoked more than once, which hints at the same duplication.
The two calls are intentional and serve different layers. The call in mana_gd_query_max_resources() runs at the GDMA layer during device setup - it needs num_ports to compute per-vPort MSI-X
allocation before any MANA-level structures exist. The call in mana_probe() runs at the MANA layer and consumes the results differently: it populates ac->num_ports, ac->bm_hostmode, and on
resume validates that the port count hasn't changed.
The HWC round-trip is lightweight (a single command/response exchange), and the function is renamed to mana_gd_ precisely because it now has callers at both layers. The debugfs move is the same
reason - the debugfs node should only be created once, so it moves to mana_probe() which runs once, rather than staying in the function that is now called from both sites.
>
> > +
> > + if (!num_ports)
> > + return -EINVAL;
> > +
> > + /*
> > + * Adjust gc->max_num_queues returned from the SOC to allow
> dedicated
> > + * MSIx for each vPort. Clamp to no less than
> MANA_DEF_NUM_QUEUES.
> > + */
> > + max_num_queues = (gc->num_msix_usable - 1) / num_ports;
> > + max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));
> > + if (max_num_queues < MANA_DEF_NUM_QUEUES)
> > + max_num_queues = MANA_DEF_NUM_QUEUES;
> > +
> > + /*
> > + * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
> > + * Ethernet EQs when (max_num_queues * num_ports >
> num_msix_usable - 1)
> > + */
> > + max_num_queues = min(gc->max_num_queues, max_num_queues);
> > + if (max_num_queues * num_ports > gc->num_msix_usable - 1)
> > + gc->msi_sharing = true;
>
> Is gc->msi_sharing ever reset to false? The only two writers are this line and
> mana_gd_setup_hwc_irqs() (the !pci_msix_can_alloc_dyn branch), and both only
> set it to true. mana_gd_remove_irqs() frees msi_bitmap and zeros
> max_num_msix / num_msix_usable, but does not clear msi_sharing, and the
> gdma_context survives mana_gd_suspend() / mana_gd_resume().
>
> Once true is latched in any setup cycle, mana_gd_setup() will always take the
> sharing branch on resume even if the recomputed max_num_queues * num_ports
> fits within num_msix_usable - 1. Is that consistent with the commit message
> saying sharing "is only enabled when there are not enough MSI-X vectors for
> dedicated allocation"?
msi_sharing is one-directional by design, but this is not a problem in practice. The inputs to the sharing computation - num_msix_usable, num_ports, and hardware queue limits - are properties of
the hardware and do not change across suspend/resume cycles. The gdma_context is kzalloc'd at probe time (so msi_sharing starts as false), and on any given hardware configuration the computation
in mana_gd_query_max_resources() will always produce the same result.
On resume, mana_gd_setup_hwc_irqs() re-evaluates dynamic MSI-X support and mana_gd_query_max_resources() recomputes the sharing check from the same hardware parameters. Since the hardware
doesn't change, the flag converges to the same value. Adding an explicit reset would be defensive but not functionally necessary.
>
> > +
> > + /* If MSI is shared, use max allowed value */
> > + if (gc->msi_sharing)
> > + gc->max_num_queues_vport = min(gc->num_msix_usable - 1,
> gc->max_num_queues);
> > + else
> > + gc->max_num_queues_vport = max_num_queues;
> > +
> > + dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
> > + gc->msi_sharing, gc->max_num_queues);
>
> Should this print gc->max_num_queues_vport rather than
> gc->max_num_queues? The block immediately above computes
> gc->max_num_queues_vport and leaves gc->max_num_queues unchanged, and
> mana_probe_port() sizes alloc_etherdev_mq() and apc->max_queues from
> gc->max_num_queues_vport:
>
> ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
> gc->max_num_queues_vport);
> ...
> apc->max_queues = gc->max_num_queues_vport;
> apc->num_queues = min(gc->max_num_queues_vport,
> MANA_DEF_NUM_QUEUES);
>
> so the logged value does not match the per-vPort queue count the driver actually
> exposes.
The printed value is intentional. gc->max_num_queues is the GDMA device-level maximum, which is consumed by both the Ethernet and RDMA drivers - the RDMA driver uses it to size its queue
resources. Printing max_num_queues_vport here would only reflect the Ethernet per-vPort limit and would be misleading in a device-level log message. The per-vPort value is already visible
through ethtool channel reporting.
>
> > +
> > return 0;
> > }
> >
> > @@ -1856,6 +1893,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev
> > *pdev)
>
> [ ... ]
>
> > @@ -1968,20 +2008,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
> > if (err)
> > goto destroy_hwc;
> >
> > - err = mana_gd_query_max_resources(pdev);
> > + err = mana_gd_detect_devices(pdev);
> > if (err)
> > goto destroy_hwc;
> >
> > - err = mana_gd_setup_remaining_irqs(pdev);
> > - if (err) {
> > - dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
> > - goto destroy_hwc;
> > - }
> > -
> > - err = mana_gd_detect_devices(pdev);
> > + err = mana_gd_query_max_resources(pdev);
> > if (err)
> > goto destroy_hwc;
> >
> > + if (!gc->msi_sharing) {
> > + gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable,
> GFP_KERNEL);
> > + if (!gc->msi_bitmap) {
> > + err = -ENOMEM;
> > + goto destroy_hwc;
> > + }
> > + /* Set bit for HWC */
> > + set_bit(0, gc->msi_bitmap);
> > + } else {
> > + err = mana_gd_setup_remaining_irqs(pdev);
> > + if (err) {
> > + dev_err(gc->dev, "Failed to setup remaining IRQs: %d",
> err);
> > + goto destroy_hwc;
> > + }
> > + }
> > +
>
> Can the driver bring up any vPort after this change when the !msi_sharing branch
> is taken?
>
> In the dedicated branch, only gc->msi_bitmap is allocated and bit 0 is set for
> HWC. mana_gd_setup_remaining_irqs() is skipped, so no gdma_irq_context is
> inserted into gc->irq_contexts for indices 1..
> num_msix_usable-1.
>
> Later, mana_create_eq() still assigns
>
> spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
>
> and mana_gd_register_irq() does:
>
> gic = xa_load(&gc->irq_contexts, msi_index);
> if (WARN_ON(!gic))
> return -EINVAL;
>
> On a typical cloud SKU with, for example, num_msix_usable=32,
> num_ports=1 and num_online_cpus=16, the new math keeps msi_sharing=false
> (16 * 1 <= 31), so every EQ-create goes down this path and hits the WARN_ON.
> Doesn't that make every vPort open and every resume fail for the common
> dedicated-MSI-X case?
>
> The msi_bitmap allocated here is not consumed anywhere in this commit; the on-
> demand allocation via mana_gd_get_gic() appears in the later commit "net:
> mana: Allocate interrupt context for each EQ when creating vPort"
> (dbbdf40a8974). Should the bitmap and the new branch be introduced in the
> same commit that actually uses them, so each commit in the series is
> independently bootable?
You're right that the non-sharing EQ creation path is not fully functional until patch 5 wires mana_gd_get_gic() into mana_create_eq(). However, this is a new capability being built
incrementally: patch 2 introduces the decision framework and bitmap, patch 3 adds the GIC infrastructure, patch 4 converts global IRQ setup to use it, and patch 5 integrates it into per-vPort EQ
creation.
The intermediate state between patches 2 and 5 results in a clean error (-EINVAL from mana_gd_register_irq) - not a crash or data corruption. The dedicated MSI-X mode is a new feature that did
not exist before this series, so there is no regression from the pre-patch baseline. Restructuring to make it functional at each intermediate commit would require squashing the GIC
infrastructure into this patch, producing a significantly larger and harder-to-review change. I'd prefer to keep the logical separation as-is.
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net-next v6 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Long Li @ 2026-05-04 22:21 UTC (permalink / raw)
To: Simon Horman
Cc: Konstantin Taranov, kuba@kernel.org, davem@davemloft.net,
pabeni@redhat.com, edumazet@google.com, andrew+netdev@lunn.ch,
jgg@ziepe.ca, leon@kernel.org, Haiyang Zhang, KY Srinivasan,
wei.liu@kernel.org, Dexuan Cui, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20260502150835.281887-1-horms@kernel.org>
> -----Original Message-----
> From: Simon Horman <horms@kernel.org>
> Sent: Saturday, May 2, 2026 8:09 AM
> To: Long Li <longli@microsoft.com>
> Cc: 'Simon Horman' <horms@kernel.org>; Konstantin Taranov
> <kotaranov@microsoft.com>; kuba@kernel.org; davem@davemloft.net;
> pabeni@redhat.com; edumazet@google.com; andrew+netdev@lunn.ch;
> jgg@ziepe.ca; leon@kernel.org; Haiyang Zhang <haiyangz@microsoft.com>; KY
> Srinivasan <kys@microsoft.com>; wei.liu@kernel.org; Dexuan Cui
> <DECUI@microsoft.com>; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [EXTERNAL] Re: [PATCH net-next v6 2/6] net: mana: Query device
> capabilities and configure MSI-X sharing for EQs
>
> From: 'Simon Horman' <horms@kernel.org>
>
> This is an AI-generated review of your patch. The human sending this email has
> considered the AI review valid, or at least plausible.
> Full review at:
> https://sashiko.d/
> ev%2F&data=05%7C02%7Clongli%40microsoft.com%7C1b8423bb5d15490288be
> 08dea85cb944%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6391333
> 13365211100%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0
> %7C%7C%7C&sdata=H6rPcEM62LmbWsXdpSVcVaaec3Wx6KJnTnYS0fBk6vY%3D
> &reserved=0
> ---
> net: mana: Query device capabilities and configure MSI-X sharing for EQs
>
> When querying the device, this adjusts the max number of queues to allow
> dedicated MSI-X vectors for each vPort, clamping to no less than
> MANA_DEF_NUM_QUEUES. MSI-X sharing among vPorts is disabled by default
> and enabled only when there are not enough vectors. It also renames
> mana_query_device_cfg() to mana_gd_query_device_cfg().
>
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
>
> [ ... ]
>
> > + /*
> > + * Adjust gc->max_num_queues returned from the SOC to allow
> dedicated
> > + * MSIx for each vPort. Clamp to no less than
> MANA_DEF_NUM_QUEUES.
> > + */
> > + max_num_queues = (gc->num_msix_usable - 1) / num_ports;
> > + max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));
>
> Does rounding up here inflate max_num_queues and unnecessarily force MSI-X
> sharing?
>
> For example, if there are 63 usable MSIs and 2 ports, the division yields 31.
> Rounding up gives 32. The subsequent check below:
>
> > + /*
> > + * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
> > + * Ethernet EQs when (max_num_queues * num_ports >
> num_msix_usable - 1)
> > + */
> > + max_num_queues = min(gc->max_num_queues, max_num_queues);
> > + if (max_num_queues * num_ports > gc->num_msix_usable - 1)
> > + gc->msi_sharing = true;
>
> would then evaluate to true (32 * 2 > 63) and force the driver into MSI-X shared
> mode. This seems to contradict the intent to use dedicated MSI-X whenever
> possible.
>
> Would it be better to use rounddown_pow_of_two() instead to ensure the
> calculated queues fit within the available dedicated vectors?
The roundup is intentional. The primary goal is to maximize the number of queues per vPort for throughput - power-of-2 queue counts align with RSS indirection table sizes and give the best
distribution. Dedicated MSI-X is preferred when it comes for free, but not at the cost of halving the queue count.
In your example (63 MSIs, 2 ports), rounddown would give 16 queues per port - nearly half the 31 that the hardware could support. Roundup gives 32 and triggers sharing mode, but sharing mode is
designed to work correctly and the queues still get serviced. The driver trades per-EQ interrupt exclusivity for better parallelism, which is the right tradeoff for network throughput on these
devices.
The sharing check exists as a fallback for exactly this situation - it is not a failure mode but a deliberate operating point.
>
> [ ... ]
>
> > @@ -1968,20 +2008,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
> > if (err)
> > goto destroy_hwc;
> >
> > - err = mana_gd_query_max_resources(pdev);
> > + err = mana_gd_detect_devices(pdev);
> > if (err)
> > goto destroy_hwc;
> >
> > - err = mana_gd_setup_remaining_irqs(pdev);
> > - if (err) {
> > - dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
> > - goto destroy_hwc;
> > - }
> > -
> > - err = mana_gd_detect_devices(pdev);
> > + err = mana_gd_query_max_resources(pdev);
> > if (err)
> > goto destroy_hwc;
> >
> > + if (!gc->msi_sharing) {
> > + gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable,
> GFP_KERNEL);
> > + if (!gc->msi_bitmap) {
> > + err = -ENOMEM;
> > + goto destroy_hwc;
> > + }
> > + /* Set bit for HWC */
> > + set_bit(0, gc->msi_bitmap);
> > + } else {
> > + err = mana_gd_setup_remaining_irqs(pdev);
> > + if (err) {
> > + dev_err(gc->dev, "Failed to setup remaining IRQs: %d",
> err);
> > + goto destroy_hwc;
> > + }
> > + }
>
> If msi_sharing is disabled, we allocate the msi_bitmap but skip calling
> mana_gd_setup_remaining_irqs().
>
> Since mana_gd_setup_hwc_irqs() only allocates a single vector for the hardware
> channel when dynamic allocation is supported, does this leave the device without
> interrupts for its Ethernet queues?
>
> If so, it seems this could lead to queue creation failures when the driver attempts
> to map uninitialized vectors. I notice this is fixed in a later patch in the series
> ("net: mana: Allocate interrupt context for each EQ when creating vPort"), but
> does leaving it out here break bisectability?
You're right that with pci_msix_can_alloc_dyn() returning true on MANA, the non-sharing path is reachable between patches 2 and 5. In that window, mana_gd_register_irq() would fail with -EINVAL
because irq_contexts for indices 1+ haven't been populated yet.
However, the failure is contained: mana_create_eq() returns an error, mana_alloc_queues() propagates it, and the interface fails to come up cleanly - no WARN, no crash, no data corruption. The
driver remains in a consistent state and succeeds once the full series is applied.
This is a new capability being built up across the series. The dedicated MSI-X mode did not exist before, so there is no regression from the pre-patch baseline - the pre-patch code always went
through mana_gd_setup_remaining_irqs() and operated in what is now called sharing mode. Restructuring the series to make non-sharing mode functional at each intermediate commit would require
squashing the GIC infrastructure (patches 3-4) into this patch, producing a single large change that is significantly harder to review.
I'd prefer to keep the logical separation as-is. If you feel strongly about strict bisectability, I could add a fallback in this patch that forces msi_sharing = true when the GIC allocator is
not yet available, and have patch 5 remove it - but that adds throwaway code to an intermediate commit.
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net-next v6 1/6] net: mana: Create separate EQs for each vPort
From: Long Li @ 2026-05-04 22:08 UTC (permalink / raw)
To: Simon Horman
Cc: Konstantin Taranov, kuba@kernel.org, davem@davemloft.net,
pabeni@redhat.com, edumazet@google.com, andrew+netdev@lunn.ch,
jgg@ziepe.ca, leon@kernel.org, Haiyang Zhang, KY Srinivasan,
wei.liu@kernel.org, Dexuan Cui, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20260502152929.GL15617@horms.kernel.org>
> On Sat, May 02, 2026 at 04:23:55PM +0100, Simon Horman wrote:
> > From: 'Simon Horman' <horms@kernel.org>
> >
> > This is an AI-generated review of your patch. The human sending this
> > email has considered the AI review valid, or at least plausible.
> > Full review at:
> > https://sash/
> >
> iko.dev%2F&data=05%7C02%7Clongli%40microsoft.com%7C50f9138d30ca49fb0
> 5b
> >
> 708dea85f9e35%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C639133
> 32578
> >
> 9863881%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIw
> LjAuMD
> >
> AwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7
> C&sd
> > ata=1Ew7dYw%2B7zQjROcj63hnOYFUfak20Pi3ytzOf2J0JWg%3D&reserved=0
>
> Sorry about this, there was supposed to be some different text here.
I have replied to both comments.
Thank you,
Long
>
> This review is available at
> https://netdev-/
> ai.bots.linux.dev%2Fsashiko%2F&data=05%7C02%7Clongli%40microsoft.com%7
> C50f9138d30ca49fb05b708dea85f9e35%7C72f988bf86f141af91ab2d7cd011db47
> %7C1%7C0%7C639133325789877695%7CUnknown%7CTWFpbGZsb3d8eyJFbXB
> 0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIs
> IldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=F4%2FR6eEmKkHI0%2FkuGfvuaM
> 42oss8KCUb9J5Bw6B682Y%3D&reserved=0
> And I apologise that it overlaps with the review from
> https://sashiko.d/
> ev%2F&data=05%7C02%7Clongli%40microsoft.com%7C50f9138d30ca49fb05b70
> 8dea85f9e35%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63913332
> 5789886888%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0
> %7C%7C%7C&sdata=g%2BFgPtQXNQSqC%2ByHDua2twXj%2BufRZ8yCze757NpY
> vU8%3D&reserved=0
> which I also posted.
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net-next v6 1/6] net: mana: Create separate EQs for each vPort
From: Long Li @ 2026-05-04 22:07 UTC (permalink / raw)
To: Simon Horman
Cc: Konstantin Taranov, kuba@kernel.org, davem@davemloft.net,
pabeni@redhat.com, edumazet@google.com, andrew+netdev@lunn.ch,
jgg@ziepe.ca, leon@kernel.org, Haiyang Zhang, KY Srinivasan,
wei.liu@kernel.org, Dexuan Cui, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20260502152354.289044-2-horms@kernel.org>
> From: 'Simon Horman' <horms@kernel.org>
>
> This is an AI-generated review of your patch. The human sending this email has
> considered the AI review valid, or at least plausible.
> Full review at:
> https://sashiko.d/
> ev%2F&data=05%7C02%7Clongli%40microsoft.com%7C36b10107f9774abb6894
> 08dea85f3721%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C6391333
> 24077887564%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0
> %7C%7C%7C&sdata=cS3FmP7jK6ftjo74XWvtBHNXYa7YKlthqLgzEudcmb4%3D&re
> served=0
> ---
> net: mana: Create separate EQs for each vPort
>
> This change moves EQ ownership from struct mana_context to struct
> mana_port_context so each vPort can have dedicated EQs, and exports
> mana_create_eq()/mana_destroy_eq() for use by the MANA RDMA driver.
>
> > diff --git a/drivers/infiniband/hw/mana/main.c
> > b/drivers/infiniband/hw/mana/main.c
> > index ac5e75dd3494..60cc02e4ad10 100644
> > --- a/drivers/infiniband/hw/mana/main.c
> > +++ b/drivers/infiniband/hw/mana/main.c
> > @@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev,
> struct mana_ib_pd *pd,
> > pd->vport_use_count--;
> > WARN_ON(pd->vport_use_count < 0);
> >
> > - if (!pd->vport_use_count)
> > + if (!pd->vport_use_count) {
> > + mana_destroy_eq(mpc);
> > mana_uncfg_vport(mpc);
> > + }
> >
> > mutex_unlock(&pd->vport_mutex);
> > }
> > @@ -55,15 +57,21 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32
> port, struct mana_ib_pd *pd,
> > return err;
> > }
> >
> > - mutex_unlock(&pd->vport_mutex);
> >
> > pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
> > pd->tx_vp_offset = mpc->tx_vp_offset;
> > + err = mana_create_eq(mpc);
> > + if (err) {
> > + mana_uncfg_vport(mpc);
> > + pd->vport_use_count--;
> > + }
> > +
> > + mutex_unlock(&pd->vport_mutex);
> >
> > ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
> > mpc->port_handle, pd->pdn, doorbell_id);
>
> On the failure path, mana_uncfg_vport() is called and pd->vport_use_count is
> decremented, but execution falls through to the ibdev_dbg() line that prints the
> vport handle as if it were a success. Should this debug print be skipped on the
> error path so logs are not ambiguous?
This is fair point. The ibdev_dbg could be guarded with if (!err) for clarity. That said, ibdev_dbg is compiled out in non-debug builds and the error itself is returned to the caller, so this is
cosmetic. Happy to add the guard if you feel strongly.
>
> Also, pd->tx_shortform_allowed and pd->tx_vp_offset are assigned from mpc
> before the mana_create_eq() attempt and are not cleared when that call fails.
> Callers are expected to gate on pd->vport_use_count, but is it worth resetting
> these fields on failure so the state matches the pre-patch behavior?
These fields are only consumed when pd->vport_use_count > 0. On failure the count is decremented back to 0, so the stale values are never read. The next successful mana_ib_cfg_vport() call will
overwrite them. This matches pre-patch behavior - pre-patch also assigned these fields unconditionally and relied on the use count to gate access.
>
> The mutex_unlock(&pd->vport_mutex) has moved from before the field
> assignments to after mana_create_eq(). That now holds pd->vport_mutex across
> GFP_KERNEL allocations, debugfs directory creation, and multiple
> mana_gd_create_mana_eq() firmware RPCs, and establishes the lock order
> pd->vport_mutex -> apc->vport_mutex via mana_uncfg_vport() on the error
> path. Is the extended critical section intentional?
Yes, this is intentional. Pre-patch, pd->vport_mutex was released before assigning pd->tx_shortform_allowed and pd->tx_vp_offset, which was a data race - a concurrent raw
QP creation on the same PD could read partially-initialized state. The extended critical section ensures the vport is fully configured (including EQs) before any concurrent user can observe
vport_use_count > 0.
>
> >
> > - return 0;
> > + return err;
> > }
> >
> > int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata) diff
> > --git a/drivers/infiniband/hw/mana/qp.c
> > b/drivers/infiniband/hw/mana/qp.c index 645581359cee..6f1043383e8c
> > 100644
> > --- a/drivers/infiniband/hw/mana/qp.c
> > +++ b/drivers/infiniband/hw/mana/qp.c
> > @@ -168,7 +168,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
> > cq_spec.gdma_region = cq->queue.gdma_region;
> > cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
> > cq_spec.modr_ctx_id = 0;
> > - eq = &mpc->ac->eqs[cq->comp_vector];
> > + /* EQs are created when a raw QP configures the vport.
> > + * A raw QP must be created before creating rwq_ind_tbl.
> > + */
> > + if (!mpc->eqs) {
> > + ret = -EINVAL;
> > + i--;
> > + goto fail;
> > + }
> > + eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>
> Can the mpc->eqs read race with a concurrent free? mana_ib_create_qp_rss()
> runs without pd->vport_mutex or RTNL, but mpc->eqs is freed by
> mana_destroy_eq() from two paths:
>
> mana_ib_uncfg_vport() (under pd->vport_mutex, on last raw-QP destroy)
> mana_dealloc_queues() (under RTNL, on netdev down)
>
> both of which do:
>
> kfree(apc->eqs);
> apc->eqs = NULL;
>
> with no RCU grace period or reader-visible synchronization. If CPU-A passes
> the !mpc->eqs check after CPU-B begins ip link set dev X down, does CPU-A then
> dereference freed memory via mpc->eqs[...].eq->id?
These two paths cannot run concurrently with RDMA QP creation on the same port. mana_cfg_vport() enforces mutual exclusion between Ethernet and RDMA via apc->vport_use_count:
mutex_lock(&apc->vport_mutex);
if (apc->vport_use_count > 0) {
mutex_unlock(&apc->vport_mutex);
return -EBUSY;
}
apc->vport_use_count++;
If RDMA holds the vport (created a raw QP), Ethernet cannot bring the interface up, so mana_dealloc_queues() cannot run. If Ethernet holds the vport, mana_ib_cfg_vport() → mana_cfg_vport()
returns -EBUSY and no RDMA raw QP is created, so mpc->eqs belongs exclusively to Ethernet.
The mana_ib_uncfg_vport() path requires destroying the last raw QP on the PD, which means no new RDMA QPs should be in flight on that PD. The IB core serializes QP creation/destruction on the
same device context.
>
> Separately, what populates mpc->eqs for an RDMA-only RSS QP user on a probed-
> but-idle Ethernet port? Pre-patch mana_probe() called
> mana_create_eq(ac) unconditionally, so ac->eqs existed for the device lifetime.
> After this patch the only creators are mana_alloc_queues() (Ethernet up) and
> mana_ib_cfg_vport() (raw QP). An RDMA application that uses only RSS QPs and
> never creates a raw QP will now get -EINVAL here where it used to succeed. Is
> this intended, and should the commit log mention it?
This is intentional. An RSS-only RDMA application (no raw QP) could never have worked in practice: mana_ib_create_qp_rss() calls mana_create_wq_obj(mpc, mpc->port_handle, ...) which requires the
vport to be configured. Without a raw QP calling mana_ib_cfg_vport() → mana_cfg_vport(), mpc->port_handle is INVALID_MANA_HANDLE and the firmware call would fail. The -EINVAL is a cleaner early
error for a path that was already broken.
>
> The adjacent comment:
>
> /* EQs are created when a raw QP configures the vport.
> * A raw QP must be created before creating rwq_ind_tbl.
> */
>
> omits the Ethernet-up path that also populates mpc->eqs. Would it be clearer to
> describe both creators?
The comment is in the RDMA code path. Due to the mana_cfg_vport() mutual exclusion described above, when RDMA is executing this code, it owns the vport - so the EQs were created by the raw QP,
not by Ethernet. The Ethernet path is not reachable when RDMA holds the port. The comment is accurate for the context it appears in.
>
> There is also a behavior change in the index expression:
>
> eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>
> Pre-patch this was ac->eqs[cq->comp_vector] sized by gc->max_num_queues.
> Now comp_vector is folded modulo mpc->num_queues, which is tunable via
> ethtool -L. Userspace that used distinct comp_vector values to hit distinct EQs
> will silently alias when comp_vector >= num_queues. Should this be documented
> or rejected with -EINVAL rather than silently wrapped?
The modulo wrap is the correct behavior. The EQ array is now sized to mpc->num_queues (the actual configured queue count), not the hardware maximum. Rejecting with -EINVAL would break
applications that set comp_vector based on the device's total num_comp_vectors - which is the standard RDMA practice. Wrapping is what other drivers do in this situation (e.g., mlx5). Pre-patch
the wrap just happened to be unnecessary because the array was oversized to the hardware max.
>
> Can mpc->num_queues be 0 at this point? mana_set_channels() does not reject
> new_count==0, and kzalloc_objs(struct mana_eq, 0) returns ZERO_SIZE_PTR,
> which passes the !mpc->eqs check. During the window between
> mana_create_eq() and the subsequent netif_set_real_num_tx_queues() failing, a
> concurrent RDMA QP create would compute
> cq->comp_vector % 0 here. Should mpc->num_queues be validated alongside
> mpc->eqs?
num_queues cannot be 0. The ethtool core validates this before calling the driver - ethnl_set_channels() rejects the request if combined_count is 0 and there are no separate rx/tx channels
(which MANA doesn't use). Additionally, num_queues is initialized at probe time to min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES) where MANA_DEF_NUM_QUEUES is 16 and the hardware always
reports at least 1 queue.
>
> The placement of the !mpc->eqs check is inside the per-iteration loop over
> ind_tbl_size, but mpc->eqs cannot change across iterations, so the check is only
> meaningful at i==0. It works today because i-- then makes i = -1 and the cleanup
> while (i-- > 0) skips. Would hoisting the check above the loop be clearer and less
> fragile against refactoring?
>
> > cq_spec.attached_eq = eq->eq->id;
> >
> > ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
> @@ -317,7
> > +325,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd
> *ibpd,
> > cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
> > cq_spec.modr_ctx_id = 0;
> > eq_vec = send_cq->comp_vector;
> > - eq = &mpc->ac->eqs[eq_vec];
> > + if (!mpc->eqs) {
> > + err = -EINVAL;
> > + goto err_destroy_queue;
> > + }
> > + eq = &mpc->eqs[eq_vec % mpc->num_queues];
>
> The same mpc->eqs TOCTOU concern applies here: the check and subsequent
> dereference occur without pd->vport_mutex or RTNL held, so a concurrent
> mana_dealloc_queues() on the Ethernet side can free mpc->eqs between the
> check and the index access. Is there synchronization that prevents this?
Same answer as above: mana_cfg_vport() mutual exclusion prevents this. In mana_ib_create_qp_raw(), mana_ib_cfg_vport() is called just before this code, which calls mana_cfg_vport(). If Ethernet
holds the port, that call returns -EBUSY and we never reach the mpc->eqs access. If we reach it, RDMA owns the vport and Ethernet cannot tear it down.
>
> The same comp_vector % num_queues aliasing and num_queues==0 divide
> concerns apply here as well.
>
> > cq_spec.attached_eq = eq->eq->id;
> >
> > err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ,
> &wq_spec,
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index a654b3699c4c..6c709f8b875d 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
>
> [ ... ]
>
> > -static int mana_create_eq(struct mana_context *ac)
> > +int mana_create_eq(struct mana_port_context *apc)
> > {
> > - struct gdma_dev *gd = ac->gdma_dev;
> > + struct gdma_dev *gd = apc->ac->gdma_dev;
> > struct gdma_context *gc = gd->gdma_context;
> > struct gdma_queue_spec spec = {};
> > int err;
> > int i;
> >
> > - ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
> > - if (!ac->eqs)
> > + WARN_ON(apc->eqs);
> > + apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
> > + if (!apc->eqs)
> > return -ENOMEM;
>
> WARN_ON(apc->eqs) is used as a defensive assertion but execution proceeds to
> overwrite apc->eqs unconditionally. If the invariant is violated, the previous array
> and all HW EQs it referenced are leaked.
> Should this instead bail out with an error or call mana_destroy_eq() first to keep
> the function idempotent against future callers?
The WARN_ON is a development assertion to catch double-init bugs. If it fires, there is already a logic error in the caller. Silently cleaning up and proceeding would mask the root cause. The
standard kernel pattern for "this must never happen" is WARN_ON + continue, not silent recovery. If we wanted to be stricter, returning -EBUSY would be reasonable, but the current approach is
consistent with kernel convention for invariant checks.
>
> [ ... ]
>
> > @@ -3326,6 +3349,9 @@ static int mana_dealloc_queues(struct net_device
> *ndev)
> > mana_fence_rqs(apc);
> >
> > /* Even in err case, still need to cleanup the vPort */
> > + mana_destroy_rxqs(apc);
> > + mana_destroy_txq(apc);
> > + mana_destroy_eq(apc);
> > mana_destroy_vport(apc);
>
> What is the intended interaction with outstanding RDMA QPs here? An RDMA
> user that created an RSS QP while the Ethernet netdev was up sees
> mpc->eqs populated by mana_alloc_queues(). When the admin issues
> ip link set dev N down, mana_dealloc_queues() -> mana_destroy_eq() destroys
> those EQs and frees mpc->eqs while the RDMA QPs are still alive, leaving the QPs
> with dangling attached_eq IDs at the hardware level and stale kernel references.
This scenario cannot occur. As described above, mana_cfg_vport() enforces mutual exclusion - Ethernet and RDMA cannot hold the same port simultaneously. If Ethernet is up, RDMA raw QP creation
on that port returns -EBUSY. If RDMA holds the port, mana_alloc_queues() → mana_create_vport() → mana_cfg_vport() returns -EBUSY and the interface fails to come up. There is no state where both
have created EQs on the same mpc.
>
> Pre-patch ac->eqs lived for the full mana_context lifetime and was torn down
> only in mana_remove(). Is unconditionally destroying the EQs on netdev-down
> the intended new behavior, and if so how are concurrent RDMA consumers
> expected to recover?
Yes, destroying EQs on netdev-down is the intended new behavior, and no RDMA recovery path is needed because the scenario has no concurrent RDMA consumers.
mana_cfg_vport() enforces mutual exclusion at the hardware port level via apc->vport_use_count - it returns -EBUSY if the port is already held. Ethernet and RDMA cannot hold the same port
simultaneously:
- If Ethernet is up → RDMA raw QP creation fails at mana_cfg_vport() with -EBUSY → no RDMA EQs or QPs exist on that port → netdev-down destroys only Ethernet's own EQs.
- If RDMA holds the port → mana_alloc_queues() → mana_create_vport() → mana_cfg_vport() returns -EBUSY → interface never comes up → mana_dealloc_queues() never runs.
Pre-patch, the device-lifetime EQs in ac->eqs were shared across all ports and both subsystems, which masked this exclusivity - the EQs were always present regardless of who owned the port.
Post-patch, each port owns its EQs, and they follow the lifecycle of whoever holds the port. The exclusion guarantee means there is nothing to recover from.
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net-next v6 1/6] net: mana: Create separate EQs for each vPort
From: Long Li @ 2026-05-04 21:51 UTC (permalink / raw)
To: Simon Horman
Cc: Konstantin Taranov, kuba@kernel.org, davem@davemloft.net,
pabeni@redhat.com, edumazet@google.com, andrew+netdev@lunn.ch,
jgg@ziepe.ca, leon@kernel.org, Haiyang Zhang, KY Srinivasan,
wei.liu@kernel.org, Dexuan Cui, netdev@vger.kernel.org,
linux-rdma@vger.kernel.org, linux-hyperv@vger.kernel.org,
linux-kernel@vger.kernel.org
In-Reply-To: <20260502150717.281387-1-horms@kernel.org>
> From: 'Simon Horman' <horms@kernel.org>
>
> This is an AI-generated review of your patch. The human sending this email has
> considered the AI review valid, or at least plausible.
> Full review at:
> https://sashiko.d/
> ev%2F&data=05%7C02%7Clongli%40microsoft.com%7C3344dd1ec8e44fc6d4a50
> 8dea85caa2a%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C63913331
> 3118303655%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0
> %7C%7C%7C&sdata=GJTF8DNw%2FzYNsPz0lC%2FDFZpcCwqjuCmfVu7smim81x
> E%3D&reserved=0
> ---
> net: mana: Create separate EQs for each vPort
>
> This commit modifies the mana driver to prepare for assigning vPorts to dedicated
> MSI-X vectors by removing EQ sharing among vPorts. It moves the EQ definition
> from the mana_context to the mana_port_context and exports EQ lifecycle
> functions for use by the MANA RDMA driver.
>
> > diff --git a/drivers/infiniband/hw/mana/qp.c
> > b/drivers/infiniband/hw/mana/qp.c index 645581359cee..6f1043383e8c
> > 100644
> > --- a/drivers/infiniband/hw/mana/qp.c
> > +++ b/drivers/infiniband/hw/mana/qp.c
> > @@ -168,7 +168,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp,
> struct ib_pd *pd,
> > cq_spec.gdma_region = cq->queue.gdma_region;
> > cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
> > cq_spec.modr_ctx_id = 0;
> > - eq = &mpc->ac->eqs[cq->comp_vector];
> > + /* EQs are created when a raw QP configures the vport.
> > + * A raw QP must be created before creating rwq_ind_tbl.
> > + */
> > + if (!mpc->eqs) {
> > + ret = -EINVAL;
> > + i--;
> > + goto fail;
> > + }
> > + eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
>
> Is it possible for the EQs to be destroyed while this RSS QP is still actively using
> them?
No. The EQ id is consumed only at creation time - it is passed to firmware as cq_spec.attached_eq during mana_create_wq_obj(). After that call the CQ-to-EQ binding lives entirely in firmware.
The kernel never dereferences mpc->eqs again for that RSS QP's lifetime, so there is no ongoing kernel-side access to the EQ struct from an active RSS QP.
>
> If the EQs are created by the Ethernet interface being brought up, or by a RAW
> QP configuring the vport, this RSS QP will attach to them without incrementing
> pd->vport_use_count or taking any vport reference count.
This is by design. The RSS QP (RX side) does not take a vport reference, and symmetrically mana_ib_destroy_qp_rss() does not release one either. Only raw QPs (SQ side) take and release vport
references. The refcount is balanced: RSS QPs are pure consumers of an already-configured vport.
>
> If the Ethernet interface is subsequently brought down, or the RAW QP is
> destroyed, mana_destroy_eq() will be called, freeing the mpc->eqs array and
> destroying the underlying DMA regions while this RSS QP remains active. This
> regression could allow the hardware to DMA completion events into freed EQ
> memory.
Destroying the raw QP also calls mana_uncfg_vport(), which deconfigures the vport entirely. After that, firmware will not route any traffic or generate completions on this vport, so there are no
in-flight DMA writes to the EQ. This is the same pre-existing behavior: the raw QP has always been the vport lifecycle anchor, and destroying it while an RSS QP is active would have broken the
vport regardless - this patch does not change that relationship. Before this patch the EQs simply outlived the vport (device lifetime vs vport lifetime), which masked the dependency but did not
make the out-of-order teardown any safer at the vport level.
>
> Additionally, since mpc->eqs is accessed here without holding pd->vport_mutex,
> could a concurrent teardown of the EQs lead to a use-after-free when reading
> eq->eq->id?
The RDMA core serializes QP creation and destruction on the same device context. A concurrent teardown would require destroying the raw QP on the same PD simultaneously with RSS QP creation,
which the IB core does not permit. The !mpc->eqs check is a defensive guard against wrong call ordering (creating an RSS QP before any raw QP), not a synchronization point for concurrent access.
^ permalink raw reply
* RE: [PATCH v2 1/2] Drivers: hv: vmbus: Provide option to skip VMBus unload on panic
From: Michael Kelley @ 2026-05-04 20:47 UTC (permalink / raw)
To: wei.liu@kernel.org, tzimmermann@suse.de, longli@microsoft.com,
jfalempe@redhat.com
Cc: drawat.floss@gmail.com, maarten.lankhorst@linux.intel.com,
mripard@kernel.org, airlied@gmail.com, simona@ffwll.ch,
kys@microsoft.com, haiyangz@microsoft.com, decui@microsoft.com,
ryasuoka@redhat.com, dri-devel@lists.freedesktop.org,
linux-kernel@vger.kernel.org, linux-hyperv@vger.kernel.org,
stable@vger.kernel.org
In-Reply-To: <20260217182335.265585-1-mhklkml@zohomail.com>
From: Michael Kelley <mhklkml@zohomail.com> Sent: Tuesday, February 17, 2026 10:24 AM
>
Wei and Thomas --
This small patch series has been neglected. Patch 2 of the series is here [1].
Long Li < longli@microsoft.com> has given a Reviewed-by on this patch,
and Jocelyn Falempe <jfalempe@redhat.com> has given a Reviewed-by
on Patch 2 of the series, modulo a comment which I have incorporated.
See [2]. But I neglected to add her R-b when I spun v2 of the series.
Any reason this can't be picked up as a bug fix for 7.1? I just checked,
and it applies cleanly to a recent linux-next (20260423). I'd suggest
going through the hyperv tree, as these two patches should be kept
together in sequence.
Michael
[1] https://lore.kernel.org/linux-hyperv/20260217182335.265585-2-mhklkml@zohomail.com/
[2] https://lore.kernel.org/linux-hyperv/a5372b72-8dc0-4f2d-ad5c-086f3e75ee81@redhat.com/
> Currently, VMBus code initiates a VMBus unload in the panic path so
> that if a kdump kernel is loaded, it can start fresh in setting up its
> own VMBus connection. However, a driver for the VMBus virtual frame
> buffer may need to flush dirty portions of the frame buffer back to
> the Hyper-V host so that panic information is visible in the graphics
> console. To support such flushing, provide exported functions for the
> frame buffer driver to specify that the VMBus unload should not be
> done by the VMBus driver, and to initiate the VMBus unload itself.
> Together these allow a frame buffer driver to delay the VMBus unload
> until after it has completed the flush.
>
> Ideally, the VMBus driver could use its own panic-path callback to do
> the unload after all frame buffer drivers have finished. But DRM frame
> buffer drivers use the kmsg dump callback, and there are no callbacks
> after that in the panic path. Hence this somewhat messy approach to
> properly sequencing the frame buffer flush and the VMBus unload.
>
> Fixes: 3671f3777758 ("drm/hyperv: Add support for drm_panic")
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
> Changes in v2: None
>
> drivers/hv/channel_mgmt.c | 1 +
> drivers/hv/hyperv_vmbus.h | 1 -
> drivers/hv/vmbus_drv.c | 25 ++++++++++++++++++-------
> include/linux/hyperv.h | 3 +++
> 4 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/hv/channel_mgmt.c b/drivers/hv/channel_mgmt.c
> index 74fed2c073d4..5de83676dbad 100644
> --- a/drivers/hv/channel_mgmt.c
> +++ b/drivers/hv/channel_mgmt.c
> @@ -944,6 +944,7 @@ void vmbus_initiate_unload(bool crash)
> else
> vmbus_wait_for_unload();
> }
> +EXPORT_SYMBOL_GPL(vmbus_initiate_unload);
>
> static void vmbus_setup_channel_state(struct vmbus_channel *channel,
> struct vmbus_channel_offer_channel *offer)
> diff --git a/drivers/hv/hyperv_vmbus.h b/drivers/hv/hyperv_vmbus.h
> index cdbc5f5c3215..5d3944fc93ae 100644
> --- a/drivers/hv/hyperv_vmbus.h
> +++ b/drivers/hv/hyperv_vmbus.h
> @@ -440,7 +440,6 @@ void hv_vss_deinit(void);
> int hv_vss_pre_suspend(void);
> int hv_vss_pre_resume(void);
> void hv_vss_onchannelcallback(void *context);
> -void vmbus_initiate_unload(bool crash);
>
> static inline void hv_poll_channel(struct vmbus_channel *channel,
> void (*cb)(void *))
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index 6785ad63a9cb..97dfa529d250 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -69,19 +69,29 @@ bool vmbus_is_confidential(void)
> }
> EXPORT_SYMBOL_GPL(vmbus_is_confidential);
>
> +static bool skip_vmbus_unload;
> +
> +/*
> + * Allow a VMBus framebuffer driver to specify that in the case of a panic,
> + * it will do the VMbus unload operation once it has flushed any dirty
> + * portions of the framebuffer to the Hyper-V host.
> + */
> +void vmbus_set_skip_unload(bool skip)
> +{
> + skip_vmbus_unload = skip;
> +}
> +EXPORT_SYMBOL_GPL(vmbus_set_skip_unload);
> +
> /*
> * The panic notifier below is responsible solely for unloading the
> * vmbus connection, which is necessary in a panic event.
> - *
> - * Notice an intrincate relation of this notifier with Hyper-V
> - * framebuffer panic notifier exists - we need vmbus connection alive
> - * there in order to succeed, so we need to order both with each other
> - * [see hvfb_on_panic()] - this is done using notifiers' priorities.
> */
> static int hv_panic_vmbus_unload(struct notifier_block *nb, unsigned long val,
> void *args)
> {
> - vmbus_initiate_unload(true);
> + if (!skip_vmbus_unload)
> + vmbus_initiate_unload(true);
> +
> return NOTIFY_DONE;
> }
> static struct notifier_block hyperv_panic_vmbus_unload_block = {
> @@ -2848,7 +2858,8 @@ static void hv_crash_handler(struct pt_regs *regs)
> {
> int cpu;
>
> - vmbus_initiate_unload(true);
> + if (!skip_vmbus_unload)
> + vmbus_initiate_unload(true);
> /*
> * In crash handler we can't schedule synic cleanup for all CPUs,
> * doing the cleanup for current CPU only. This should be sufficient
> diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
> index dfc516c1c719..b0502a336eb3 100644
> --- a/include/linux/hyperv.h
> +++ b/include/linux/hyperv.h
> @@ -1334,6 +1334,9 @@ int vmbus_allocate_mmio(struct resource **new, struct
> hv_device *device_obj,
> bool fb_overlap_ok);
> void vmbus_free_mmio(resource_size_t start, resource_size_t size);
>
> +void vmbus_initiate_unload(bool crash);
> +void vmbus_set_skip_unload(bool skip);
> +
> /*
> * GUID definitions of various offer types - services offered to the guest.
> */
> --
> 2.25.1
>
^ permalink raw reply
* [PATCH v3 18/18] mshv: Fix missing error code on VP allocation failure
From: Stanislav Kinsburskii @ 2026-05-04 19:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
In mshv_partition_ioctl_create_vp(), when kzalloc for the VP struct
fails, the code jumps to the cleanup path without setting ret. At that
point ret is 0 from the preceding successful mshv_vp_stats_map() call,
so the function returns success to userspace despite having failed to
create the VP. No fd is installed and no VP is registered in pt_vp_array,
but userspace has no way to know the operation failed.
Set ret to -ENOMEM before jumping to the cleanup path.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_root_main.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 4d58e0b005183..52b760dec3687 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1186,8 +1186,10 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
goto unmap_ghcb_page;
vp = kzalloc_obj(*vp);
- if (!vp)
+ if (!vp) {
+ ret = -ENOMEM;
goto unmap_stats_pages;
+ }
vp->vp_partition = mshv_partition_get(partition);
if (!vp->vp_partition) {
^ permalink raw reply related
* [PATCH v3 17/18] mshv: Validate scheduler message bounds from hypervisor
From: Stanislav Kinsburskii @ 2026-05-04 19:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
handle_pair_message() iterates up to msg->vp_count without verifying it
against HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT. Since vp_count is read
from untrusted hypervisor data, a malformed message with a large value
would cause out-of-bounds reads from the partition_ids and vp_indexes
arrays.
handle_bitset_message() iterates over set bits in valid_bank_mask (up to
64) and advances bank_contents for each one. However, the payload buffer
only has space for 16 bank entries. A valid_bank_mask with more than 16
bits set causes bank_contents to read beyond the message buffer.
Fix both by adding bounds validation:
- Clamp vp_count to HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT
- Track banks consumed and stop before exceeding buffer capacity
Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_synic.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index a916a39888aad..3eb2e535efbcd 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -188,7 +188,9 @@ static void kick_vp(struct mshv_vp *vp)
static void
handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
{
- int bank_idx, vps_signaled = 0, bank_mask_size;
+ int bank_idx, vps_signaled = 0, bank_mask_size, banks_used = 0;
+ const int max_banks = sizeof(msg->vp_bitset.bitset_buffer) /
+ sizeof(u64) - 2; /* subtract format + mask */
struct mshv_partition *partition;
const struct hv_vpset *vpset;
const u64 *bank_contents;
@@ -228,6 +230,11 @@ handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
if (bank_idx == bank_mask_size)
break;
+ if (unlikely(banks_used >= max_banks)) {
+ pr_debug("valid_bank_mask exceeds buffer capacity\n");
+ goto unlock_out;
+ }
+
while (true) {
struct mshv_vp *vp;
@@ -256,6 +263,7 @@ handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
}
bank_contents++;
+ banks_used++;
}
unlock_out:
@@ -272,10 +280,18 @@ handle_pair_message(const struct hv_vp_signal_pair_scheduler_message *msg)
struct mshv_partition *partition = NULL;
struct mshv_vp *vp;
int idx;
+ u8 vp_count = msg->vp_count;
+
+ if (unlikely(vp_count > HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT)) {
+ pr_debug("pair message vp_count %u exceeds max %lu\n",
+ vp_count,
+ (unsigned long)HV_MESSAGE_MAX_PARTITION_VP_PAIR_COUNT);
+ return;
+ }
rcu_read_lock();
- for (idx = 0; idx < msg->vp_count; idx++) {
+ for (idx = 0; idx < vp_count; idx++) {
u64 partition_id = msg->partition_ids[idx];
u32 vp_index = msg->vp_indexes[idx];
^ permalink raw reply related
* [PATCH v3 16/18] mshv: Add store/load ordering for VP array publish
From: Stanislav Kinsburskii @ 2026-05-04 19:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_vp_create() publishes a VP pointer to pt_vp_array with a plain
store. The ISR reads this array locklessly from interrupt context on
other CPUs. Without memory ordering, a reader may observe the non-NULL
pointer before all VP fields (e.g. vp_register_page, run.vp_suspend_queue)
are fully initialized, leading to use of uninitialized data.
Fix by using smp_store_release() when publishing the VP pointer and
smp_load_acquire() on all lockless ISR read sites. This guarantees that
all VP initialization is visible to readers before the pointer itself.
Fixes: 621191d709b1 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_eventfd.c | 2 +-
drivers/hv/mshv_root_main.c | 3 ++-
drivers/hv/mshv_synic.c | 6 +++---
3 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
index c99ca5770d3d0..780f0ea43f2c6 100644
--- a/drivers/hv/mshv_eventfd.c
+++ b/drivers/hv/mshv_eventfd.c
@@ -180,7 +180,7 @@ static int mshv_try_assert_irq_fast(struct mshv_irqfd *irqfd,
if (irq->lapic_apic_id >= MSHV_MAX_VPS)
return -EINVAL;
- vp = partition->pt_vp_array[irq->lapic_apic_id];
+ vp = smp_load_acquire(&partition->pt_vp_array[irq->lapic_apic_id]);
if (!vp)
return -EINVAL;
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 0e9b696853fcb..4d58e0b005183 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -1224,7 +1224,8 @@ mshv_partition_ioctl_create_vp(struct mshv_partition *partition,
/* already exclusive with the partition mutex for all ioctls */
partition->pt_vp_count++;
- partition->pt_vp_array[args.vp_index] = vp;
+ /* Ensure VP is fully initialized before visible to lockless ISR readers */
+ smp_store_release(&partition->pt_vp_array[args.vp_index], vp);
goto out;
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 39c6b22e1e11e..a916a39888aad 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -245,7 +245,7 @@ handle_bitset_message(const struct hv_vp_signal_bitset_scheduler_message *msg)
goto unlock_out;
}
- vp = partition->pt_vp_array[vp_index];
+ vp = smp_load_acquire(&partition->pt_vp_array[vp_index]);
if (unlikely(!vp)) {
pr_debug("failed to find VP %u\n", vp_index);
goto unlock_out;
@@ -294,7 +294,7 @@ handle_pair_message(const struct hv_vp_signal_pair_scheduler_message *msg)
break;
}
- vp = partition->pt_vp_array[vp_index];
+ vp = smp_load_acquire(&partition->pt_vp_array[vp_index]);
if (!vp) {
pr_debug("failed to find VP %u\n", vp_index);
break;
@@ -390,7 +390,7 @@ mshv_intercept_isr(struct hv_message *msg)
goto unlock_out;
}
vp_index = array_index_nospec(vp_index, MSHV_MAX_VPS);
- vp = partition->pt_vp_array[vp_index];
+ vp = smp_load_acquire(&partition->pt_vp_array[vp_index]);
if (unlikely(!vp)) {
pr_debug("failed to find VP %u\n", vp_index);
goto unlock_out;
^ permalink raw reply related
* [PATCH v3 15/18] mshv: Add missing vp_index bounds check in intercept ISR
From: Stanislav Kinsburskii @ 2026-05-04 19:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_intercept_isr() reads vp_index from the intercept message payload
and uses it directly to index into partition->pt_vp_array without
validating it against MSHV_MAX_VPS. A malformed or corrupted hypervisor
message with a vp_index beyond the array bounds would cause an
out-of-bounds memory access in interrupt context, likely crashing the
host.
Use READ_ONCE() when reading vp_index from the shared SynIC message
page to prevent the compiler from re-fetching the value between the
bounds check and the array access.
Both handle_bitset_message() and handle_pair_message() already validate
vp_index before use. Add the same check to mshv_intercept_isr() for
consistency.
Fixes: 621191d709b14 ("Drivers: hv: Introduce mshv_root module to expose /dev/mshv to VMMs")
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_synic.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_synic.c b/drivers/hv/mshv_synic.c
index 43f1bcbbf2d34..39c6b22e1e11e 100644
--- a/drivers/hv/mshv_synic.c
+++ b/drivers/hv/mshv_synic.c
@@ -8,6 +8,7 @@
*/
#include <linux/kernel.h>
+#include <linux/nospec.h>
#include <linux/slab.h>
#include <linux/mm.h>
#include <linux/interrupt.h>
@@ -382,8 +383,13 @@ mshv_intercept_isr(struct hv_message *msg)
* (because the vp is only deleted when the partition is), no additional
* locking is needed here
*/
- vp_index =
- ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index;
+ vp_index = READ_ONCE(
+ ((struct hv_opaque_intercept_message *)msg->u.payload)->vp_index);
+ if (unlikely(vp_index >= MSHV_MAX_VPS)) {
+ pr_debug("VP index %u out of bounds\n", vp_index);
+ goto unlock_out;
+ }
+ vp_index = array_index_nospec(vp_index, MSHV_MAX_VPS);
vp = partition->pt_vp_array[vp_index];
if (unlikely(!vp)) {
pr_debug("failed to find VP %u\n", vp_index);
^ permalink raw reply related
* [PATCH v3 14/18] mshv: Use kfree_rcu in mshv_portid_free
From: Stanislav Kinsburskii @ 2026-05-04 19:10 UTC (permalink / raw)
To: kys, haiyangz, wei.liu, decui, longli; +Cc: linux-hyperv, linux-kernel
In-Reply-To: <177792164525.90827.16672331609214066870.stgit@skinsburskii-cloud-desktop.internal.cloudapp.net>
mshv_portid_free() uses synchronize_rcu() followed by kfree() to
reclaim port table entries. This blocks the caller until a full RCU
grace period elapses, which is unnecessary since the same module already
uses the non-blocking kfree_rcu() pattern in mshv_port_table_fini().
Replace with kfree_rcu() to avoid the blocking wait and keep the
reclamation strategy consistent across the file.
Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
---
drivers/hv/mshv_portid_table.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/hv/mshv_portid_table.c b/drivers/hv/mshv_portid_table.c
index d6884c601b298..1ccbafe7aa596 100644
--- a/drivers/hv/mshv_portid_table.c
+++ b/drivers/hv/mshv_portid_table.c
@@ -62,8 +62,7 @@ mshv_portid_free(int port_id)
WARN_ON(!info);
idr_unlock(&port_table_idr);
- synchronize_rcu();
- kfree(info);
+ kfree_rcu(info, portbl_rcu);
}
int
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox