* [PATCH 06/12] platform/wmi: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-24 0:59 UTC (permalink / raw)
To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 12046f8c77e0 ("platform/x86: wmi: Add driver_override support")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/platform/wmi/core.c | 36 +++++-------------------------------
include/linux/wmi.h | 4 ----
2 files changed, 5 insertions(+), 35 deletions(-)
diff --git a/drivers/platform/wmi/core.c b/drivers/platform/wmi/core.c
index b8e6b9a421c6..750e3619724e 100644
--- a/drivers/platform/wmi/core.c
+++ b/drivers/platform/wmi/core.c
@@ -842,39 +842,11 @@ static ssize_t expensive_show(struct device *dev,
}
static DEVICE_ATTR_RO(expensive);
-static ssize_t driver_override_show(struct device *dev, struct device_attribute *attr,
- char *buf)
-{
- struct wmi_device *wdev = to_wmi_device(dev);
- ssize_t ret;
-
- device_lock(dev);
- ret = sysfs_emit(buf, "%s\n", wdev->driver_override);
- device_unlock(dev);
-
- return ret;
-}
-
-static ssize_t driver_override_store(struct device *dev, struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct wmi_device *wdev = to_wmi_device(dev);
- int ret;
-
- ret = driver_set_override(dev, &wdev->driver_override, buf, count);
- if (ret < 0)
- return ret;
-
- return count;
-}
-static DEVICE_ATTR_RW(driver_override);
-
static struct attribute *wmi_attrs[] = {
&dev_attr_modalias.attr,
&dev_attr_guid.attr,
&dev_attr_instance_count.attr,
&dev_attr_expensive.attr,
- &dev_attr_driver_override.attr,
NULL
};
ATTRIBUTE_GROUPS(wmi);
@@ -943,7 +915,6 @@ static void wmi_dev_release(struct device *dev)
{
struct wmi_block *wblock = dev_to_wblock(dev);
- kfree(wblock->dev.driver_override);
kfree(wblock);
}
@@ -952,10 +923,12 @@ static int wmi_dev_match(struct device *dev, const struct device_driver *driver)
const struct wmi_driver *wmi_driver = to_wmi_driver(driver);
struct wmi_block *wblock = dev_to_wblock(dev);
const struct wmi_device_id *id = wmi_driver->id_table;
+ int ret;
/* When driver_override is set, only bind to the matching driver */
- if (wblock->dev.driver_override)
- return !strcmp(wblock->dev.driver_override, driver->name);
+ ret = device_match_driver_override(dev, driver);
+ if (ret >= 0)
+ return ret;
if (id == NULL)
return 0;
@@ -1076,6 +1049,7 @@ static struct class wmi_bus_class = {
static const struct bus_type wmi_bus_type = {
.name = "wmi",
.dev_groups = wmi_groups,
+ .driver_override = true,
.match = wmi_dev_match,
.uevent = wmi_dev_uevent,
.probe = wmi_dev_probe,
diff --git a/include/linux/wmi.h b/include/linux/wmi.h
index 75cb0c7cfe57..14fb644e1701 100644
--- a/include/linux/wmi.h
+++ b/include/linux/wmi.h
@@ -18,16 +18,12 @@
* struct wmi_device - WMI device structure
* @dev: Device associated with this WMI device
* @setable: True for devices implementing the Set Control Method
- * @driver_override: Driver name to force a match; do not set directly,
- * because core frees it; use driver_set_override() to
- * set or clear it.
*
* This represents WMI devices discovered by the WMI driver core.
*/
struct wmi_device {
struct device dev;
bool setable;
- const char *driver_override;
};
/**
--
2.53.0
^ permalink raw reply related
* [PATCH 05/12] PCI: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-24 0:59 UTC (permalink / raw)
To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 782a985d7af2 ("PCI: Introduce new device binding path using pci_dev.driver_override")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/pci/pci-driver.c | 11 +++++++----
drivers/pci/pci-sysfs.c | 28 ----------------------------
drivers/pci/probe.c | 1 -
drivers/vfio/pci/vfio_pci_core.c | 5 ++---
drivers/xen/xen-pciback/pci_stub.c | 6 ++++--
include/linux/pci.h | 6 ------
6 files changed, 13 insertions(+), 44 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index dd9075403987..d10ece0889f0 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -138,9 +138,11 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
{
struct pci_dynid *dynid;
const struct pci_device_id *found_id = NULL, *ids;
+ int ret;
/* When driver_override is set, only bind to the matching driver */
- if (dev->driver_override && strcmp(dev->driver_override, drv->name))
+ ret = device_match_driver_override(&dev->dev, &drv->driver);
+ if (ret == 0)
return NULL;
/* Look at the dynamic ids first, before the static ones */
@@ -164,7 +166,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
* matching.
*/
if (found_id->override_only) {
- if (dev->driver_override)
+ if (ret > 0)
return found_id;
} else {
return found_id;
@@ -172,7 +174,7 @@ static const struct pci_device_id *pci_match_device(struct pci_driver *drv,
}
/* driver_override will always match, send a dummy id */
- if (dev->driver_override)
+ if (ret > 0)
return &pci_device_id_any;
return NULL;
}
@@ -452,7 +454,7 @@ static int __pci_device_probe(struct pci_driver *drv, struct pci_dev *pci_dev)
static inline bool pci_device_can_probe(struct pci_dev *pdev)
{
return (!pdev->is_virtfn || pdev->physfn->sriov->drivers_autoprobe ||
- pdev->driver_override);
+ device_has_driver_override(&pdev->dev));
}
#else
static inline bool pci_device_can_probe(struct pci_dev *pdev)
@@ -1722,6 +1724,7 @@ static const struct cpumask *pci_device_irq_get_affinity(struct device *dev,
const struct bus_type pci_bus_type = {
.name = "pci",
+ .driver_override = true,
.match = pci_bus_match,
.uevent = pci_uevent,
.probe = pci_device_probe,
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 16eaaf749ba9..a9006cf4e9c8 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -615,33 +615,6 @@ static ssize_t devspec_show(struct device *dev,
static DEVICE_ATTR_RO(devspec);
#endif
-static ssize_t driver_override_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct pci_dev *pdev = to_pci_dev(dev);
- int ret;
-
- ret = driver_set_override(dev, &pdev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct pci_dev *pdev = to_pci_dev(dev);
- ssize_t len;
-
- device_lock(dev);
- len = sysfs_emit(buf, "%s\n", pdev->driver_override);
- device_unlock(dev);
- return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
static struct attribute *pci_dev_attrs[] = {
&dev_attr_power_state.attr,
&dev_attr_resource.attr,
@@ -669,7 +642,6 @@ static struct attribute *pci_dev_attrs[] = {
#ifdef CONFIG_OF
&dev_attr_devspec.attr,
#endif
- &dev_attr_driver_override.attr,
&dev_attr_ari_enabled.attr,
NULL,
};
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index bccc7a4bdd79..b4707640e102 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -2488,7 +2488,6 @@ static void pci_release_dev(struct device *dev)
pci_release_of_node(pci_dev);
pcibios_release_device(pci_dev);
pci_bus_put(pci_dev->bus);
- kfree(pci_dev->driver_override);
bitmap_free(pci_dev->dma_alias_mask);
dev_dbg(dev, "device released\n");
kfree(pci_dev);
diff --git a/drivers/vfio/pci/vfio_pci_core.c b/drivers/vfio/pci/vfio_pci_core.c
index d43745fe4c84..460852f79f29 100644
--- a/drivers/vfio/pci/vfio_pci_core.c
+++ b/drivers/vfio/pci/vfio_pci_core.c
@@ -1987,9 +1987,8 @@ static int vfio_pci_bus_notifier(struct notifier_block *nb,
pdev->is_virtfn && physfn == vdev->pdev) {
pci_info(vdev->pdev, "Captured SR-IOV VF %s driver_override\n",
pci_name(pdev));
- pdev->driver_override = kasprintf(GFP_KERNEL, "%s",
- vdev->vdev.ops->name);
- WARN_ON(!pdev->driver_override);
+ WARN_ON(device_set_driver_override(&pdev->dev,
+ vdev->vdev.ops->name));
} else if (action == BUS_NOTIFY_BOUND_DRIVER &&
pdev->is_virtfn && physfn == vdev->pdev) {
struct pci_driver *drv = pci_dev_driver(pdev);
diff --git a/drivers/xen/xen-pciback/pci_stub.c b/drivers/xen/xen-pciback/pci_stub.c
index e4b27aecbf05..79a2b5dfd694 100644
--- a/drivers/xen/xen-pciback/pci_stub.c
+++ b/drivers/xen/xen-pciback/pci_stub.c
@@ -598,6 +598,8 @@ static int pcistub_seize(struct pci_dev *dev,
return err;
}
+static struct pci_driver xen_pcibk_pci_driver;
+
/* Called when 'bind'. This means we must _NOT_ call pci_reset_function or
* other functions that take the sysfs lock. */
static int pcistub_probe(struct pci_dev *dev, const struct pci_device_id *id)
@@ -609,8 +611,8 @@ static int pcistub_probe(struct pci_dev *dev, const struct pci_device_id *id)
match = pcistub_match(dev);
- if ((dev->driver_override &&
- !strcmp(dev->driver_override, PCISTUB_DRIVER_NAME)) ||
+ if (device_match_driver_override(&dev->dev,
+ &xen_pcibk_pci_driver.driver) > 0 ||
match) {
if (dev->hdr_type != PCI_HEADER_TYPE_NORMAL
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 1c270f1d5123..57e9463e4347 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -575,12 +575,6 @@ struct pci_dev {
u8 supported_speeds; /* Supported Link Speeds Vector */
phys_addr_t rom; /* Physical address if not from BAR */
size_t romlen; /* Length if not from BAR */
- /*
- * Driver name to force a match. Do not set directly, because core
- * frees it. Use driver_set_override() to set or clear it.
- */
- const char *driver_override;
-
unsigned long priv_flags; /* Private flags for the PCI driver */
/* These methods index pci_reset_fn_methods[] */
--
2.53.0
^ permalink raw reply related
* [PATCH 04/12] hv: vmbus: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-24 0:59 UTC (permalink / raw)
To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: d765edbb301c ("vmbus: add driver_override support")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/hv/vmbus_drv.c | 36 +++++-------------------------------
include/linux/hyperv.h | 5 -----
2 files changed, 5 insertions(+), 36 deletions(-)
diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
index bc4fc1951ae1..bc8dfd136f3c 100644
--- a/drivers/hv/vmbus_drv.c
+++ b/drivers/hv/vmbus_drv.c
@@ -541,34 +541,6 @@ static ssize_t device_show(struct device *dev,
}
static DEVICE_ATTR_RO(device);
-static ssize_t driver_override_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct hv_device *hv_dev = device_to_hv_device(dev);
- int ret;
-
- ret = driver_set_override(dev, &hv_dev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct hv_device *hv_dev = device_to_hv_device(dev);
- ssize_t len;
-
- device_lock(dev);
- len = sysfs_emit(buf, "%s\n", hv_dev->driver_override);
- device_unlock(dev);
-
- return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
/* Set up per device attributes in /sys/bus/vmbus/devices/<bus device> */
static struct attribute *vmbus_dev_attrs[] = {
&dev_attr_id.attr,
@@ -599,7 +571,6 @@ static struct attribute *vmbus_dev_attrs[] = {
&dev_attr_channel_vp_mapping.attr,
&dev_attr_vendor.attr,
&dev_attr_device.attr,
- &dev_attr_driver_override.attr,
NULL,
};
@@ -711,9 +682,11 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
{
const guid_t *guid = &dev->dev_type;
const struct hv_vmbus_device_id *id;
+ int ret;
/* When driver_override is set, only bind to the matching driver */
- if (dev->driver_override && strcmp(dev->driver_override, drv->name))
+ ret = device_match_driver_override(&dev->device, &drv->driver);
+ if (ret == 0)
return NULL;
/* Look at the dynamic ids first, before the static ones */
@@ -722,7 +695,7 @@ static const struct hv_vmbus_device_id *hv_vmbus_get_id(const struct hv_driver *
id = hv_vmbus_dev_match(drv->id_table, guid);
/* driver_override will always match, send a dummy id */
- if (!id && dev->driver_override)
+ if (!id && ret > 0)
id = &vmbus_device_null;
return id;
@@ -1024,6 +997,7 @@ static const struct dev_pm_ops vmbus_pm = {
/* The one and only one */
static const struct bus_type hv_bus = {
.name = "vmbus",
+ .driver_override = true,
.match = vmbus_match,
.shutdown = vmbus_shutdown,
.remove = vmbus_remove,
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index dfc516c1c719..bf689d07d750 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1272,11 +1272,6 @@ struct hv_device {
u16 device_id;
struct device device;
- /*
- * Driver name to force a match. Do not set directly, because core
- * frees it. Use driver_set_override() to set or clear it.
- */
- const char *driver_override;
struct vmbus_channel *channel;
struct kset *channels_kset;
--
2.53.0
^ permalink raw reply related
* [PATCH 03/12] cdx: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-24 0:59 UTC (permalink / raw)
To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 2959ab247061 ("cdx: add the cdx bus driver")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/cdx/cdx.c | 40 +++++--------------------------------
include/linux/cdx/cdx_bus.h | 4 ----
2 files changed, 5 insertions(+), 39 deletions(-)
diff --git a/drivers/cdx/cdx.c b/drivers/cdx/cdx.c
index 9196dc50a48d..d3d230247262 100644
--- a/drivers/cdx/cdx.c
+++ b/drivers/cdx/cdx.c
@@ -156,8 +156,6 @@ static int cdx_unregister_device(struct device *dev,
} else {
cdx_destroy_res_attr(cdx_dev, MAX_CDX_DEV_RESOURCES);
debugfs_remove_recursive(cdx_dev->debugfs_dir);
- kfree(cdx_dev->driver_override);
- cdx_dev->driver_override = NULL;
}
/*
@@ -268,6 +266,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
const struct cdx_driver *cdx_drv = to_cdx_driver(drv);
const struct cdx_device_id *found_id = NULL;
const struct cdx_device_id *ids;
+ int ret;
if (cdx_dev->is_bus)
return false;
@@ -275,7 +274,8 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
ids = cdx_drv->match_id_table;
/* When driver_override is set, only bind to the matching driver */
- if (cdx_dev->driver_override && strcmp(cdx_dev->driver_override, drv->name))
+ ret = device_match_driver_override(dev, drv);
+ if (ret == 0)
return false;
found_id = cdx_match_id(ids, cdx_dev);
@@ -289,7 +289,7 @@ static int cdx_bus_match(struct device *dev, const struct device_driver *drv)
*/
if (!found_id->override_only)
return true;
- if (cdx_dev->driver_override)
+ if (ret > 0)
return true;
ids = found_id + 1;
@@ -453,36 +453,6 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RO(modalias);
-static ssize_t driver_override_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct cdx_device *cdx_dev = to_cdx_device(dev);
- int ret;
-
- if (WARN_ON(dev->bus != &cdx_bus_type))
- return -EINVAL;
-
- ret = driver_set_override(dev, &cdx_dev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct cdx_device *cdx_dev = to_cdx_device(dev);
- ssize_t len;
-
- device_lock(dev);
- len = sysfs_emit(buf, "%s\n", cdx_dev->driver_override);
- device_unlock(dev);
- return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
static ssize_t enable_store(struct device *dev, struct device_attribute *attr,
const char *buf, size_t count)
{
@@ -552,7 +522,6 @@ static struct attribute *cdx_dev_attrs[] = {
&dev_attr_class.attr,
&dev_attr_revision.attr,
&dev_attr_modalias.attr,
- &dev_attr_driver_override.attr,
NULL,
};
@@ -646,6 +615,7 @@ ATTRIBUTE_GROUPS(cdx_bus);
const struct bus_type cdx_bus_type = {
.name = "cdx",
+ .driver_override = true,
.match = cdx_bus_match,
.probe = cdx_probe,
.remove = cdx_remove,
diff --git a/include/linux/cdx/cdx_bus.h b/include/linux/cdx/cdx_bus.h
index b1ba97f6c9ad..f54770f110bc 100644
--- a/include/linux/cdx/cdx_bus.h
+++ b/include/linux/cdx/cdx_bus.h
@@ -137,9 +137,6 @@ struct cdx_controller {
* @enabled: is this bus enabled
* @msi_dev_id: MSI Device ID associated with CDX device
* @num_msi: Number of MSI's supported by the device
- * @driver_override: driver name to force a match; do not set directly,
- * because core frees it; use driver_set_override() to
- * set or clear it.
* @irqchip_lock: lock to synchronize irq/msi configuration
* @msi_write_pending: MSI write pending for this device
*/
@@ -165,7 +162,6 @@ struct cdx_device {
bool enabled;
u32 msi_dev_id;
u32 num_msi;
- const char *driver_override;
struct mutex irqchip_lock;
bool msi_write_pending;
};
--
2.53.0
^ permalink raw reply related
* [PATCH 02/12] bus: fsl-mc: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-24 0:59 UTC (permalink / raw)
To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 1f86a00c1159 ("bus/fsl-mc: add support for 'driver_override' in the mc-bus")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/bus/fsl-mc/fsl-mc-bus.c | 43 +++++--------------------------
drivers/vfio/fsl-mc/vfio_fsl_mc.c | 4 +--
include/linux/fsl/mc.h | 4 ---
3 files changed, 8 insertions(+), 43 deletions(-)
diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c
index c117745cf206..221146e4860b 100644
--- a/drivers/bus/fsl-mc/fsl-mc-bus.c
+++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
@@ -86,12 +86,16 @@ static int fsl_mc_bus_match(struct device *dev, const struct device_driver *drv)
struct fsl_mc_device *mc_dev = to_fsl_mc_device(dev);
const struct fsl_mc_driver *mc_drv = to_fsl_mc_driver(drv);
bool found = false;
+ int ret;
/* When driver_override is set, only bind to the matching driver */
- if (mc_dev->driver_override) {
- found = !strcmp(mc_dev->driver_override, mc_drv->driver.name);
+ ret = device_match_driver_override(dev, drv);
+ if (ret > 0) {
+ found = true;
goto out;
}
+ if (ret == 0)
+ goto out;
if (!mc_drv->match_id_table)
goto out;
@@ -210,39 +214,8 @@ static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
}
static DEVICE_ATTR_RO(modalias);
-static ssize_t driver_override_store(struct device *dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct fsl_mc_device *mc_dev = to_fsl_mc_device(dev);
- int ret;
-
- if (WARN_ON(dev->bus != &fsl_mc_bus_type))
- return -EINVAL;
-
- ret = driver_set_override(dev, &mc_dev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-
-static ssize_t driver_override_show(struct device *dev,
- struct device_attribute *attr, char *buf)
-{
- struct fsl_mc_device *mc_dev = to_fsl_mc_device(dev);
- ssize_t len;
-
- device_lock(dev);
- len = sysfs_emit(buf, "%s\n", mc_dev->driver_override);
- device_unlock(dev);
- return len;
-}
-static DEVICE_ATTR_RW(driver_override);
-
static struct attribute *fsl_mc_dev_attrs[] = {
&dev_attr_modalias.attr,
- &dev_attr_driver_override.attr,
NULL,
};
@@ -345,6 +318,7 @@ ATTRIBUTE_GROUPS(fsl_mc_bus);
const struct bus_type fsl_mc_bus_type = {
.name = "fsl-mc",
+ .driver_override = true,
.match = fsl_mc_bus_match,
.uevent = fsl_mc_bus_uevent,
.probe = fsl_mc_probe,
@@ -910,9 +884,6 @@ static struct notifier_block fsl_mc_nb;
*/
void fsl_mc_device_remove(struct fsl_mc_device *mc_dev)
{
- kfree(mc_dev->driver_override);
- mc_dev->driver_override = NULL;
-
/*
* The device-specific remove callback will get invoked by device_del()
*/
diff --git a/drivers/vfio/fsl-mc/vfio_fsl_mc.c b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
index 462fae1aa538..b4c3958201b2 100644
--- a/drivers/vfio/fsl-mc/vfio_fsl_mc.c
+++ b/drivers/vfio/fsl-mc/vfio_fsl_mc.c
@@ -424,9 +424,7 @@ static int vfio_fsl_mc_bus_notifier(struct notifier_block *nb,
if (action == BUS_NOTIFY_ADD_DEVICE &&
vdev->mc_dev == mc_cont) {
- mc_dev->driver_override = kasprintf(GFP_KERNEL, "%s",
- vfio_fsl_mc_ops.name);
- if (!mc_dev->driver_override)
+ if (device_set_driver_override(dev, vfio_fsl_mc_ops.name))
dev_warn(dev, "VFIO_FSL_MC: Setting driver override for device in dprc %s failed\n",
dev_name(&mc_cont->dev));
else
diff --git a/include/linux/fsl/mc.h b/include/linux/fsl/mc.h
index 897d6211c163..1da63f2d7040 100644
--- a/include/linux/fsl/mc.h
+++ b/include/linux/fsl/mc.h
@@ -178,9 +178,6 @@ struct fsl_mc_obj_desc {
* @regions: pointer to array of MMIO region entries
* @irqs: pointer to array of pointers to interrupts allocated to this device
* @resource: generic resource associated with this MC object device, if any.
- * @driver_override: driver name to force a match; do not set directly,
- * because core frees it; use driver_set_override() to
- * set or clear it.
*
* Generic device object for MC object devices that are "attached" to a
* MC bus.
@@ -214,7 +211,6 @@ struct fsl_mc_device {
struct fsl_mc_device_irq **irqs;
struct fsl_mc_resource *resource;
struct device_link *consumer_link;
- const char *driver_override;
};
#define to_fsl_mc_device(_dev) \
--
2.53.0
^ permalink raw reply related
* [PATCH 01/12] amba: use generic driver_override infrastructure
From: Danilo Krummrich @ 2026-03-24 0:59 UTC (permalink / raw)
To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Danilo Krummrich, Gui-Dong Han
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>
When a driver is probed through __driver_attach(), the bus' match()
callback is called without the device lock held, thus accessing the
driver_override field without a lock, which can cause a UAF.
Fix this by using the driver-core driver_override infrastructure taking
care of proper locking internally.
Note that calling match() from __driver_attach() without the device lock
held is intentional. [1]
Link: https://lore.kernel.org/driver-core/DGRGTIRHA62X.3RY09D9SOK77P@kernel.org/ [1]
Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=220789
Fixes: 3cf385713460 ("ARM: 8256/1: driver coamba: add device binding path 'driver_override'")
Signed-off-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/amba/bus.c | 37 ++++++-------------------------------
include/linux/amba/bus.h | 5 -----
2 files changed, 6 insertions(+), 36 deletions(-)
diff --git a/drivers/amba/bus.c b/drivers/amba/bus.c
index 6d479caf89cb..d721d64a9858 100644
--- a/drivers/amba/bus.c
+++ b/drivers/amba/bus.c
@@ -82,33 +82,6 @@ static void amba_put_disable_pclk(struct amba_device *pcdev)
}
-static ssize_t driver_override_show(struct device *_dev,
- struct device_attribute *attr, char *buf)
-{
- struct amba_device *dev = to_amba_device(_dev);
- ssize_t len;
-
- device_lock(_dev);
- len = sprintf(buf, "%s\n", dev->driver_override);
- device_unlock(_dev);
- return len;
-}
-
-static ssize_t driver_override_store(struct device *_dev,
- struct device_attribute *attr,
- const char *buf, size_t count)
-{
- struct amba_device *dev = to_amba_device(_dev);
- int ret;
-
- ret = driver_set_override(_dev, &dev->driver_override, buf, count);
- if (ret)
- return ret;
-
- return count;
-}
-static DEVICE_ATTR_RW(driver_override);
-
#define amba_attr_func(name,fmt,arg...) \
static ssize_t name##_show(struct device *_dev, \
struct device_attribute *attr, char *buf) \
@@ -126,7 +99,6 @@ amba_attr_func(resource, "\t%016llx\t%016llx\t%016lx\n",
static struct attribute *amba_dev_attrs[] = {
&dev_attr_id.attr,
&dev_attr_resource.attr,
- &dev_attr_driver_override.attr,
NULL,
};
ATTRIBUTE_GROUPS(amba_dev);
@@ -209,10 +181,11 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
{
struct amba_device *pcdev = to_amba_device(dev);
const struct amba_driver *pcdrv = to_amba_driver(drv);
+ int ret;
mutex_lock(&pcdev->periphid_lock);
if (!pcdev->periphid) {
- int ret = amba_read_periphid(pcdev);
+ ret = amba_read_periphid(pcdev);
/*
* Returning any error other than -EPROBE_DEFER from bus match
@@ -230,8 +203,9 @@ static int amba_match(struct device *dev, const struct device_driver *drv)
mutex_unlock(&pcdev->periphid_lock);
/* When driver_override is set, only bind to the matching driver */
- if (pcdev->driver_override)
- return !strcmp(pcdev->driver_override, drv->name);
+ ret = device_match_driver_override(dev, drv);
+ if (ret >= 0)
+ return ret;
return amba_lookup(pcdrv->id_table, pcdev) != NULL;
}
@@ -436,6 +410,7 @@ static const struct dev_pm_ops amba_pm = {
const struct bus_type amba_bustype = {
.name = "amba",
.dev_groups = amba_dev_groups,
+ .driver_override = true,
.match = amba_match,
.uevent = amba_uevent,
.probe = amba_probe,
diff --git a/include/linux/amba/bus.h b/include/linux/amba/bus.h
index 9946276aff73..6c54d5c0d21f 100644
--- a/include/linux/amba/bus.h
+++ b/include/linux/amba/bus.h
@@ -71,11 +71,6 @@ struct amba_device {
unsigned int cid;
struct amba_cs_uci_id uci;
unsigned int irq[AMBA_NR_IRQS];
- /*
- * Driver name to force a match. Do not set directly, because core
- * frees it. Use driver_set_override() to set or clear it.
- */
- const char *driver_override;
};
struct amba_driver {
--
2.53.0
^ permalink raw reply related
* [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-03-24 0:59 UTC (permalink / raw)
To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
Harald Freudenberger, Holger Dengler, Mark Brown,
Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
Alex Williamson, Juergen Gross, Stefano Stabellini,
Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
Danilo Krummrich
This is the follow-up of the driver_override generalization in [1], converting
the remaining 11 busses and removing the now-unused driver_set_override()
helper.
All of them (except AP, which has a different race condition) are prone to the
potential UAF described in [2], caused by accessing the driver_override field
from their corresponding match() callback.
In order to address this, the generalized driver_override field in struct device
is protected with a spinlock. The driver-core provides accessors, such as
device_match_driver_override(), device_has_driver_override() and
device_set_driver_override(), which all ensure proper locking internally.
Additionally, the driver-core provides a driver_override flag in struct
bus_type, which, once enabled, automatically registers generic sysfs callbacks,
allowing userspace to modify the driver_override field.
SPI and AP are a bit special; both print "\n" when driver_override is not set,
whereas all other buses (and thus the driver-core) produce "(null)\n" in this
case.
Hence, SPI and AP do not take advantage of the driver_override flag in struct
bus_type; AP additionally maintains a counter in its custom sysfs store().
Technically, we could support a custom fallback string when driver_override is
unset in struct bus_type, but only SPI would benefit from this, since AP has
additional custom logic in store() anyways.
(I'm not sure if there are userspace programs that strictly rely on this;
driverctl seems to check for both, but I rather not break some userspace tool
I'm not aware of. :)
This series is based on v7.0-rc5 with no additional dependencies, hence those
patches can be picked up by subsystems individually.
[1] https://lore.kernel.org/driver-core/20260303115720.48783-1-dakr@kernel.org/
[2] https://bugzilla.kernel.org/show_bug.cgi?id=220789
[3] https://gitlab.com/driverctl/driverctl/-/blob/0.121/driverctl?ref_type=tags#L99
Danilo Krummrich (12):
amba: use generic driver_override infrastructure
bus: fsl-mc: use generic driver_override infrastructure
cdx: use generic driver_override infrastructure
hv: vmbus: use generic driver_override infrastructure
PCI: use generic driver_override infrastructure
platform/wmi: use generic driver_override infrastructure
rpmsg: use generic driver_override infrastructure
vdpa: use generic driver_override infrastructure
s390/cio: use generic driver_override infrastructure
s390/ap: use generic driver_override infrastructure
spi: use generic driver_override infrastructure
driver core: remove driver_set_override()
drivers/amba/bus.c | 37 +++------------
drivers/base/driver.c | 75 ------------------------------
drivers/bus/fsl-mc/fsl-mc-bus.c | 43 +++--------------
drivers/cdx/cdx.c | 40 ++--------------
drivers/hv/vmbus_drv.c | 36 ++------------
drivers/pci/pci-driver.c | 11 +++--
drivers/pci/pci-sysfs.c | 28 -----------
drivers/pci/probe.c | 1 -
drivers/platform/wmi/core.c | 36 ++------------
drivers/rpmsg/qcom_glink_native.c | 2 -
drivers/rpmsg/rpmsg_core.c | 43 +++--------------
drivers/rpmsg/virtio_rpmsg_bus.c | 1 -
drivers/s390/cio/cio.h | 5 --
drivers/s390/cio/css.c | 34 ++------------
drivers/s390/crypto/ap_bus.c | 34 +++++++-------
drivers/s390/crypto/ap_bus.h | 1 -
drivers/s390/crypto/ap_queue.c | 24 +++-------
drivers/spi/spi.c | 19 +++-----
drivers/vdpa/vdpa.c | 48 ++-----------------
drivers/vfio/fsl-mc/vfio_fsl_mc.c | 4 +-
drivers/vfio/pci/vfio_pci_core.c | 5 +-
drivers/xen/xen-pciback/pci_stub.c | 6 ++-
include/linux/amba/bus.h | 5 --
include/linux/cdx/cdx_bus.h | 4 --
include/linux/device/driver.h | 2 -
include/linux/fsl/mc.h | 4 --
include/linux/hyperv.h | 5 --
include/linux/pci.h | 6 ---
include/linux/rpmsg.h | 4 --
include/linux/spi/spi.h | 5 --
include/linux/vdpa.h | 4 --
include/linux/wmi.h | 4 --
32 files changed, 88 insertions(+), 488 deletions(-)
base-commit: c369299895a591d96745d6492d4888259b004a9e
--
2.53.0
^ permalink raw reply
* Re: [PATCH net-next v4] net: mana: Expose hardware diagnostic info via debugfs
From: Jakub Kicinski @ 2026-03-24 0:44 UTC (permalink / raw)
To: Erni Sri Satya Vennela
Cc: kys, haiyangz, wei.liu, decui, longli, andrew+netdev, davem,
edumazet, pabeni, kotaranov, horms, shradhagupta, shirazsaleem,
dipayanroy, yury.norov, kees, ssengar, gargaditya, linux-hyperv,
netdev, linux-kernel, linux-rdma
In-Reply-To: <20260319070926.1459515-1-ernis@linux.microsoft.com>
On Thu, 19 Mar 2026 00:09:13 -0700 Erni Sri Satya Vennela wrote:
> Add debugfs entries to expose hardware configuration and diagnostic
> information that aids in debugging driver initialization and runtime
> operations without adding noise to dmesg.
>
> The debugfs directory creation and removal for each PCI device is
> integrated into mana_gd_setup() and mana_gd_cleanup_device()
> respectively, so that all callers (probe, remove, suspend, resume,
> shutdown) share a single code path.
>
> Device-level entries (under /sys/kernel/debug/mana/<slot>/):
> - num_msix_usable, max_num_queues: Max resources from hardware
> - gdma_protocol_ver, pf_cap_flags1: VF version negotiation results
> - num_vports, bm_hostmode: Device configuration
>
> Per-vPort entries (under /sys/kernel/debug/mana/<slot>/vportN/):
> - port_handle: Hardware vPort handle
> - max_sq, max_rq: Max queues from vPort config
> - indir_table_sz: Indirection table size
> - steer_rx, steer_rss, steer_update_tab, steer_cqe_coalescing:
> Last applied steering configuration parameters
AI says:
> @@ -1918,15 +1930,23 @@ static int mana_gd_setup(struct pci_dev *pdev)
> struct gdma_context *gc = pci_get_drvdata(pdev);
> int err;
>
> + if (gc->is_pf)
> + gc->mana_pci_debugfs = debugfs_create_dir("0", mana_debugfs_root);
> + else
> + gc->mana_pci_debugfs = debugfs_create_dir(pci_slot_name(pdev->slot),
> + mana_debugfs_root);
If pdev->slot is NULL (which can happen for VFs in environments like generic
VFIO passthrough or nested KVM), will pci_slot_name(pdev->slot) cause a
NULL pointer dereference?
Also, could this naming scheme cause name collisions? If multiple PFs are
present, they would all try to use "0". Similarly, VFs across different
PCI domains or buses might share the same physical slot identifier, leading
to -EEXIST errors. Would it be safer to use the unique PCI BDF via
pci_name(pdev) instead?
> @@ -3141,6 +3149,24 @@ static int mana_init_port(struct net_device *ndev)
> eth_hw_addr_set(ndev, apc->mac_addr);
> sprintf(vport, "vport%d", port_idx);
> apc->mana_port_debugfs = debugfs_create_dir(vport, gc->mana_pci_debugfs);
> +
> + debugfs_create_u64("port_handle", 0400, apc->mana_port_debugfs,
> + &apc->port_handle);
When operations like changing the MTU or setting an XDP program trigger a
detach/attach cycle, mana_detach() invokes mana_cleanup_port_context(),
which recursively removes the apc->mana_port_debugfs directory.
During re-attachment, mana_attach() calls mana_init_port(), which
recreates the directory and the new files added in this patch. However, the
pre-existing current_speed file (created in mana_probe_port()) is not
recreated here.
Does this cause the current_speed file to be permanently lost after a
detach/attach cycle? Should the creation of current_speed be moved to
mana_init_port() so it survives the cycle?
--
pw-bot: cr
^ permalink raw reply
* Re: [PATCH ethtool-next] netlink: settings: add netlink support for RX CQE Coalescing params
From: patchwork-bot+netdevbpf @ 2026-03-23 22:20 UTC (permalink / raw)
To: Haiyang Zhang; +Cc: mkubecek, linux-hyperv, netdev, haiyangz, paulros
In-Reply-To: <20260320203159.1590235-1-haiyangz@linux.microsoft.com>
Hello:
This patch was applied to ethtool/ethtool.git (next)
by Michal Kubecek <mkubecek@suse.cz>:
On Fri, 20 Mar 2026 13:31:59 -0700 you wrote:
> From: Haiyang Zhang <haiyangz@microsoft.com>
>
> Add support to get/set RX CQE Coalescing parameters, including the max frames
> and time out value in nanoseconds.
>
> (Headers: dc3d720e12f6 "net: ethtool: add ethtool COALESCE_RX_CQE_FRAMES/NSECS")
>
> [...]
Here is the summary with links:
- [ethtool-next] netlink: settings: add netlink support for RX CQE Coalescing params
https://git.kernel.org/pub/scm/network/ethtool/ethtool.git/commit/?id=d35d87fbcda9
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply
* [PATCH rdma v2] RDMA/mana_ib: Disable RX steering on RSS QP destroy
From: Long Li @ 2026-03-23 20:10 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel,
stable
When an RSS QP is destroyed (e.g. DPDK exit), mana_ib_destroy_qp_rss()
destroys the RX WQ objects but does not disable vPort RX steering in
firmware. This leaves stale steering configuration that still points to
the destroyed RX objects.
If traffic continues to arrive (e.g. peer VM is still transmitting) and
the VF interface is subsequently brought up (mana_open), the firmware
may deliver completions using stale CQ IDs from the old RX objects.
These CQ IDs can be reused by the ethernet driver for new TX CQs,
causing RX completions to land on TX CQs:
WARNING: mana_poll_tx_cq+0x1b8/0x220 [mana] (is_sq == false)
WARNING: mana_gd_process_eq_events+0x209/0x290 (cq_table lookup fails)
Fix this by disabling vPort RX steering before destroying RX WQ objects.
Note that mana_fence_rqs() cannot be used here because the fence
completion is delivered on the CQ, which is polled by user-mode (e.g.
DPDK) and not visible to the kernel driver.
Refactor the disable logic into a shared mana_disable_vport_rx() in
mana_en, exported for use by mana_ib, replacing the duplicate code.
The ethernet driver's mana_dealloc_queues() is also updated to call
this common function.
Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure Network Adapter")
Cc: stable@vger.kernel.org
Signed-off-by: Long Li <longli@microsoft.com>
---
v2:
- Removed redundant ibdev_err on mana_disable_vport_rx() failure as
mana_cfg_vport_steering() already logs all failure scenarios.
- Added comment clarifying this is best effort.
drivers/infiniband/hw/mana/qp.c | 15 +++++++++++++++
drivers/net/ethernet/microsoft/mana/mana_en.c | 11 ++++++++++-
include/net/mana/mana.h | 1 +
3 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 80cf4ade4b75..685e61e8436c 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -834,6 +834,21 @@ static int mana_ib_destroy_qp_rss(struct mana_ib_qp *qp,
ndev = mana_ib_get_netdev(qp->ibqp.device, qp->port);
mpc = netdev_priv(ndev);
+ /* Disable vPort RX steering before destroying RX WQ objects.
+ * Otherwise firmware still routes traffic to the destroyed queues,
+ * which can cause bogus completions on reused CQ IDs when the
+ * ethernet driver later creates new queues on mana_open().
+ *
+ * Unlike the ethernet teardown path, mana_fence_rqs() cannot be
+ * used here because the fence completion CQE is delivered on the
+ * CQ which is polled by userspace (e.g. DPDK), so there is no way
+ * for the kernel to wait for fence completion.
+ *
+ * This is best effort — if it fails there is not much we can do,
+ * and mana_cfg_vport_steering() already logs the error.
+ */
+ mana_disable_vport_rx(mpc);
+
for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
ibwq = ind_tbl->ind_tbl[i];
wq = container_of(ibwq, struct mana_ib_wq, ibwq);
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b3c3a70f733f..0816279f525e 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -2934,6 +2934,13 @@ static void mana_rss_table_init(struct mana_port_context *apc)
ethtool_rxfh_indir_default(i, apc->num_queues);
}
+int mana_disable_vport_rx(struct mana_port_context *apc)
+{
+ return mana_cfg_vport_steering(apc, TRI_STATE_FALSE, false, false,
+ false);
+}
+EXPORT_SYMBOL_NS(mana_disable_vport_rx, "NET_MANA");
+
int mana_config_rss(struct mana_port_context *apc, enum TRI_STATE rx,
bool update_hash, bool update_tab)
{
@@ -3339,10 +3346,12 @@ static int mana_dealloc_queues(struct net_device *ndev)
*/
apc->rss_state = TRI_STATE_FALSE;
- err = mana_config_rss(apc, TRI_STATE_FALSE, false, false);
+ err = mana_disable_vport_rx(apc);
if (err && mana_en_need_log(apc, err))
netdev_err(ndev, "Failed to disable vPort: %d\n", err);
+ mana_fence_rqs(apc);
+
/* Even in err case, still need to cleanup the vPort */
mana_destroy_rxqs(apc);
mana_destroy_txq(apc);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 204c2b612a62..2634e9135eed 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -574,6 +574,7 @@ struct mana_port_context {
netdev_tx_t mana_start_xmit(struct sk_buff *skb, struct net_device *ndev);
int mana_config_rss(struct mana_port_context *ac, enum TRI_STATE rx,
bool update_hash, bool update_tab);
+int mana_disable_vport_rx(struct mana_port_context *apc);
int mana_alloc_queues(struct net_device *ndev);
int mana_attach(struct net_device *ndev);
--
2.43.0
^ permalink raw reply related
* RE: [EXTERNAL] Re: [PATCH net-next v4 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management
From: Long Li @ 2026-03-23 20:01 UTC (permalink / raw)
To: Simon Horman
Cc: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Leon Romanovsky,
Haiyang Zhang, KY Srinivasan, Wei Liu, Dexuan Cui,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260323134337.GA69756@horms.kernel.org>
> -----Original Message-----
> From: Simon Horman <horms@kernel.org>
> Sent: Monday, March 23, 2026 6:44 AM
> To: Long Li <longli@microsoft.com>
> Cc: Konstantin Taranov <kotaranov@microsoft.com>; Jakub Kicinski
> <kuba@kernel.org>; David S . Miller <davem@davemloft.net>; Paolo Abeni
> <pabeni@redhat.com>; Eric Dumazet <edumazet@google.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; Jason Gunthorpe <jgg@ziepe.ca>; Leon Romanovsky
> <leon@kernel.org>; Haiyang Zhang <haiyangz@microsoft.com>; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; netdev@vger.kernel.org; linux-rdma@vger.kernel.org;
> linux-hyperv@vger.kernel.org; linux-kernel@vger.kernel.org
> Subject: [EXTERNAL] Re: [PATCH net-next v4 0/6] net: mana: Per-vPort EQ and
> MSI-X interrupt management
>
> On Fri, Mar 20, 2026 at 04:54:13PM -0700, Long Li wrote:
> > This series adds per-vPort Event Queue (EQ) allocation and MSI-X
> > interrupt management for the MANA driver. Previously, all vPorts
> > shared a single set of EQs. This change enables dedicated EQs per
> > vPort with support for both dedicated and shared MSI-X vector allocation
> modes.
>
> ...
>
> Hi Long Li,
>
> Unfortunately this series did not apply to net-next cleanly.
> Which breaks our CI.
>
> Please rebase and repost.
>
> Thanks!
I have sent v5 of the patch set.
Please apply the patch set after this patch "net: mana: Set default number of queues to 16"
Thank you,
Long
^ permalink raw reply
* [PATCH net-next v5 6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs
From: Long Li @ 2026-03-23 19:59 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260323195952.1767304-1-longli@microsoft.com>
Use the GIC functions to allocate interrupt contexts for RDMA EQs. These
interrupt contexts may be shared with Ethernet EQs when MSI-X vectors
are limited.
The driver now supports allocating dedicated MSI-X for each EQ. Indicate
this capability through driver capability bits.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/infiniband/hw/mana/main.c | 33 ++++++++++++++++++++++++++-----
include/net/mana/gdma.h | 7 +++++--
2 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index d51dd0ee85f4..0b74dd093b41 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -787,6 +787,7 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
struct gdma_queue_spec spec = {};
+ struct gdma_irq_context *gic;
int err, i;
spec.type = GDMA_EQ;
@@ -797,9 +798,15 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
spec.eq.msix_index = 0;
+ gic = mana_gd_get_gic(gc, false, &spec.eq.msix_index);
+ if (!gic)
+ return -ENOMEM;
+
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->fatal_err_eq);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, 0);
return err;
+ }
mdev->eqs = kzalloc_objs(struct gdma_queue *,
mdev->ib_dev.num_comp_vectors);
@@ -810,31 +817,47 @@ int mana_ib_create_eqs(struct mana_ib_dev *mdev)
spec.eq.callback = NULL;
for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, false, &spec.eq.msix_index);
+ if (!gic) {
+ err = -ENOMEM;
+ goto destroy_eqs;
+ }
+
err = mana_gd_create_mana_eq(mdev->gdma_dev, &spec, &mdev->eqs[i]);
- if (err)
+ if (err) {
+ mana_gd_put_gic(gc, false, spec.eq.msix_index);
goto destroy_eqs;
+ }
}
return 0;
destroy_eqs:
- while (i-- > 0)
+ while (i-- > 0) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ mana_gd_put_gic(gc, false, (i + 1) % gc->num_msix_usable);
+ }
kfree(mdev->eqs);
destroy_fatal_eq:
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
return err;
}
void mana_ib_destroy_eqs(struct mana_ib_dev *mdev)
{
struct gdma_context *gc = mdev_to_gc(mdev);
- int i;
+ int i, msi;
mana_gd_destroy_queue(gc, mdev->fatal_err_eq);
+ mana_gd_put_gic(gc, false, 0);
- for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++)
+ for (i = 0; i < mdev->ib_dev.num_comp_vectors; i++) {
mana_gd_destroy_queue(gc, mdev->eqs[i]);
+ msi = (i + 1) % gc->num_msix_usable;
+ mana_gd_put_gic(gc, false, msi);
+ }
kfree(mdev->eqs);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 84f85b2299b4..9faa072e779e 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -615,6 +615,7 @@ enum {
#define GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECONFIG BIT(3)
#define GDMA_DRV_CAP_FLAG_1_GDMA_PAGES_4MB_1GB_2GB BIT(4)
#define GDMA_DRV_CAP_FLAG_1_VARIABLE_INDIRECTION_TABLE_SUPPORT BIT(5)
+#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
/* Driver can handle holes (zeros) in the device list */
#define GDMA_DRV_CAP_FLAG_1_DEV_LIST_HOLES_SUP BIT(11)
@@ -631,7 +632,8 @@ enum {
/* Driver detects stalled send queues and recovers them */
#define GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY BIT(18)
-#define GDMA_DRV_CAP_FLAG_1_HW_VPORT_LINK_AWARE BIT(6)
+/* Driver supports separate EQ/MSIs for each vPort */
+#define GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT BIT(19)
/* Driver supports linearizing the skb when num_sge exceeds hardware limit */
#define GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE BIT(20)
@@ -659,7 +661,8 @@ enum {
GDMA_DRV_CAP_FLAG_1_SKB_LINEARIZE | \
GDMA_DRV_CAP_FLAG_1_PROBE_RECOVERY | \
GDMA_DRV_CAP_FLAG_1_HANDLE_STALL_SQ_RECOVERY | \
- GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY)
+ GDMA_DRV_CAP_FLAG_1_HWC_TIMEOUT_RECOVERY | \
+ GDMA_DRV_CAP_FLAG_1_EQ_MSI_UNSHARE_MULTI_VPORT)
#define GDMA_DRV_CAP_FLAGS2 0
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v5 5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
From: Long Li @ 2026-03-23 19:59 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260323195952.1767304-1-longli@microsoft.com>
Use GIC functions to create a dedicated interrupt context or acquire a
shared interrupt context for each EQ when setting up a vPort.
Signed-off-by: Long Li <longli@microsoft.com>
---
drivers/net/ethernet/microsoft/mana/gdma_main.c | 2 +-
drivers/net/ethernet/microsoft/mana/mana_en.c | 17 ++++++++++++++++-
include/net/mana/gdma.h | 1 +
3 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index e7d5e589a217..34b19e0740e1 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -826,7 +826,6 @@ static void mana_gd_deregister_irq(struct gdma_queue *queue)
}
spin_unlock_irqrestore(&gic->lock, flags);
- queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
synchronize_rcu();
}
@@ -941,6 +940,7 @@ static int mana_gd_create_eq(struct gdma_dev *gd,
out:
dev_err(dev, "Failed to create EQ: %d\n", err);
mana_gd_destroy_eq(gc, false, queue);
+ queue->eq.msix_index = INVALID_PCI_MSIX_INDEX;
return err;
}
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 004d48bba8aa..b3c3a70f733f 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1606,6 +1606,7 @@ void mana_destroy_eq(struct mana_port_context *apc)
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
int i;
+ unsigned int msi;
if (!apc->eqs)
return;
@@ -1618,7 +1619,9 @@ void mana_destroy_eq(struct mana_port_context *apc)
if (!eq)
continue;
+ msi = eq->eq.msix_index;
mana_gd_destroy_queue(gc, eq);
+ mana_gd_put_gic(gc, !gc->msi_sharing, msi);
}
kfree(apc->eqs);
@@ -1635,6 +1638,7 @@ static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
+ debugfs_create_u32("irq", 0400, eq.mana_eq_debugfs, &eq.eq->eq.irq);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
@@ -1645,6 +1649,7 @@ int mana_create_eq(struct mana_port_context *apc)
struct gdma_queue_spec spec = {};
int err;
int i;
+ struct gdma_irq_context *gic;
WARN_ON(apc->eqs);
apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
@@ -1661,12 +1666,22 @@ int mana_create_eq(struct mana_port_context *apc)
apc->mana_eqs_debugfs = debugfs_create_dir("EQs", apc->mana_port_debugfs);
for (i = 0; i < apc->num_queues; i++) {
- spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+ if (gc->msi_sharing)
+ spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
+
+ gic = mana_gd_get_gic(gc, !gc->msi_sharing, &spec.eq.msix_index);
+ if (!gic) {
+ err = -ENOMEM;
+ goto out;
+ }
+
err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
+ mana_gd_put_gic(gc, !gc->msi_sharing, spec.eq.msix_index);
goto out;
}
+ apc->eqs[i].eq->eq.irq = gic->irq;
mana_create_eq_debugfs(apc, i);
}
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 4614a6a7271b..84f85b2299b4 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -342,6 +342,7 @@ struct gdma_queue {
void *context;
unsigned int msix_index;
+ unsigned int irq;
u32 log2_throttle_limit;
} eq;
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v5 4/6] net: mana: Use GIC functions to allocate global EQs
From: Long Li @ 2026-03-23 19:59 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260323195952.1767304-1-longli@microsoft.com>
Replace the GDMA global interrupt setup code with the new GIC allocation
and release functions for managing interrupt contexts.
Signed-off-by: Long Li <longli@microsoft.com>
---
.../net/ethernet/microsoft/mana/gdma_main.c | 80 +++----------------
1 file changed, 10 insertions(+), 70 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 69a4427919f5..e7d5e589a217 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1860,30 +1860,13 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
* further used in irq_setup()
*/
for (i = 1; i <= nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- /* one pci vector is already allocated for HWC */
- irqs[i - 1] = pci_irq_vector(pdev, i);
- if (irqs[i - 1] < 0) {
- err = irqs[i - 1];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i - 1], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i - 1] = gic->irq;
}
/*
@@ -1905,19 +1888,11 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
kfree(irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
for (i -= 1; i > 0; i--) {
irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+ mana_gd_put_gic(gc, false, i);
}
kfree(irqs);
return err;
@@ -1938,34 +1913,13 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
start_irqs = irqs;
for (i = 0; i < nvec; i++) {
- gic = kzalloc_obj(*gic);
+ gic = mana_gd_get_gic(gc, false, &i);
if (!gic) {
err = -ENOMEM;
goto free_irq;
}
- gic->handler = mana_gd_process_eq_events;
- INIT_LIST_HEAD(&gic->eq_list);
- spin_lock_init(&gic->lock);
-
- if (!i)
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
- pci_name(pdev));
- else
- snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_q%d@pci:%s",
- i - 1, pci_name(pdev));
-
- irqs[i] = pci_irq_vector(pdev, i);
- if (irqs[i] < 0) {
- err = irqs[i];
- goto free_current_gic;
- }
-
- err = request_irq(irqs[i], mana_gd_intr, 0, gic->name, gic);
- if (err)
- goto free_current_gic;
-
- xa_store(&gc->irq_contexts, i, gic, GFP_KERNEL);
+ irqs[i] = gic->irq;
}
/* If number of IRQ is one extra than number of online CPUs,
@@ -1994,19 +1948,11 @@ static int mana_gd_setup_irqs(struct pci_dev *pdev, int nvec)
kfree(start_irqs);
return 0;
-free_current_gic:
- kfree(gic);
free_irq:
for (i -= 1; i >= 0; i--) {
irq = pci_irq_vector(pdev, i);
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
- continue;
-
irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+ mana_gd_put_gic(gc, false, i);
}
kfree(start_irqs);
@@ -2081,26 +2027,20 @@ static int mana_gd_setup_remaining_irqs(struct pci_dev *pdev)
static void mana_gd_remove_irqs(struct pci_dev *pdev)
{
struct gdma_context *gc = pci_get_drvdata(pdev);
- struct gdma_irq_context *gic;
int irq, i;
if (gc->max_num_msix < 1)
return;
for (i = 0; i < gc->max_num_msix; i++) {
- irq = pci_irq_vector(pdev, i);
- if (irq < 0)
- continue;
-
- gic = xa_load(&gc->irq_contexts, i);
- if (WARN_ON(!gic))
+ if (!xa_load(&gc->irq_contexts, i))
continue;
/* Need to clear the hint before free_irq */
+ irq = pci_irq_vector(pdev, i);
irq_update_affinity_hint(irq, NULL);
- free_irq(irq, gic);
- xa_erase(&gc->irq_contexts, i);
- kfree(gic);
+
+ mana_gd_put_gic(gc, false, i);
}
pci_free_irq_vectors(pdev);
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v5 3/6] net: mana: Introduce GIC context with refcounting for interrupt management
From: Long Li @ 2026-03-23 19:59 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260323195952.1767304-1-longli@microsoft.com>
To allow Ethernet EQs to use dedicated or shared MSI-X vectors and RDMA
EQs to share the same MSI-X, introduce a GIC (GDMA IRQ Context) with
reference counting. This allows the driver to create an interrupt context
on an assigned or unassigned MSI-X vector and share it across multiple
EQ consumers.
Signed-off-by: Long Li <longli@microsoft.com>
---
Changes in v4:
- Track dyn_msix in GIC context instead of re-checking
pci_msix_can_alloc_dyn() on each call
- Improved remove_irqs iteration to skip unallocated entries
Changes in v2:
- Fixed spelling typo in gdma_main.c ("difference" -> "different")
---
.../net/ethernet/microsoft/mana/gdma_main.c | 159 ++++++++++++++++++
include/net/mana/gdma.h | 11 ++
2 files changed, 170 insertions(+)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index ae18b4054a02..69a4427919f5 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -1587,6 +1587,164 @@ static irqreturn_t mana_gd_intr(int irq, void *arg)
return IRQ_HANDLED;
}
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi)
+{
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct msi_map irq_map;
+ struct gdma_irq_context *gic;
+ int irq;
+
+ mutex_lock(&gc->gic_mutex);
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (WARN_ON(!gic)) {
+ mutex_unlock(&gc->gic_mutex);
+ return;
+ }
+
+ if (use_msi_bitmap)
+ gic->bitmap_refs--;
+
+ if (use_msi_bitmap && gic->bitmap_refs == 0)
+ clear_bit(msi, gc->msi_bitmap);
+
+ if (!refcount_dec_and_test(&gic->refcount))
+ goto out;
+
+ irq = pci_irq_vector(dev, msi);
+
+ irq_update_affinity_hint(irq, NULL);
+ free_irq(irq, gic);
+
+ if (gic->dyn_msix) {
+ irq_map.virq = irq;
+ irq_map.index = msi;
+ pci_msix_free_irq(dev, irq_map);
+ }
+
+ xa_erase(&gc->irq_contexts, msi);
+ kfree(gic);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+}
+EXPORT_SYMBOL_NS(mana_gd_put_gic, "NET_MANA");
+
+/*
+ * Get a GIC (GDMA IRQ Context) on a MSI vector
+ * a MSI can be shared between different EQs, this function supports setting
+ * up separate MSIs using a bitmap, or directly using the MSI index
+ *
+ * @use_msi_bitmap:
+ * True if MSI is assigned by this function on available slots from bitmap.
+ * False if MSI is passed from *msi_requested
+ */
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc,
+ bool use_msi_bitmap,
+ int *msi_requested)
+{
+ struct gdma_irq_context *gic;
+ struct pci_dev *dev = to_pci_dev(gc->dev);
+ struct msi_map irq_map = { };
+ int irq;
+ int msi;
+ int err;
+
+ mutex_lock(&gc->gic_mutex);
+
+ if (use_msi_bitmap) {
+ msi = find_first_zero_bit(gc->msi_bitmap, gc->num_msix_usable);
+ if (msi >= gc->num_msix_usable) {
+ dev_err(gc->dev, "No free MSI vectors available\n");
+ gic = NULL;
+ goto out;
+ }
+ *msi_requested = msi;
+ } else {
+ msi = *msi_requested;
+ }
+
+ gic = xa_load(&gc->irq_contexts, msi);
+ if (gic) {
+ refcount_inc(&gic->refcount);
+ if (use_msi_bitmap) {
+ gic->bitmap_refs++;
+ set_bit(msi, gc->msi_bitmap);
+ }
+ goto out;
+ }
+
+ irq = pci_irq_vector(dev, msi);
+ if (irq == -EINVAL) {
+ irq_map = pci_msix_alloc_irq_at(dev, msi, NULL);
+ if (!irq_map.virq) {
+ err = irq_map.index;
+ dev_err(gc->dev,
+ "Failed to alloc irq_map msi %d err %d\n",
+ msi, err);
+ gic = NULL;
+ goto out;
+ }
+ irq = irq_map.virq;
+ msi = irq_map.index;
+ }
+
+ gic = kzalloc(sizeof(*gic), GFP_KERNEL);
+ if (!gic) {
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ gic->handler = mana_gd_process_eq_events;
+ gic->msi = msi;
+ gic->irq = irq;
+ INIT_LIST_HEAD(&gic->eq_list);
+ spin_lock_init(&gic->lock);
+
+ if (!gic->msi)
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_hwc@pci:%s",
+ pci_name(dev));
+ else
+ snprintf(gic->name, MANA_IRQ_NAME_SZ, "mana_msi%d@pci:%s",
+ gic->msi, pci_name(dev));
+
+ err = request_irq(irq, mana_gd_intr, 0, gic->name, gic);
+ if (err) {
+ dev_err(gc->dev, "Failed to request irq %d %s\n",
+ irq, gic->name);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ gic->dyn_msix = !!irq_map.virq;
+ refcount_set(&gic->refcount, 1);
+ gic->bitmap_refs = use_msi_bitmap ? 1 : 0;
+
+ err = xa_err(xa_store(&gc->irq_contexts, msi, gic, GFP_KERNEL));
+ if (err) {
+ dev_err(gc->dev, "Failed to store irq context for msi %d: %d\n",
+ msi, err);
+ free_irq(irq, gic);
+ kfree(gic);
+ gic = NULL;
+ if (irq_map.virq)
+ pci_msix_free_irq(dev, irq_map);
+ goto out;
+ }
+
+ if (use_msi_bitmap)
+ set_bit(msi, gc->msi_bitmap);
+
+out:
+ mutex_unlock(&gc->gic_mutex);
+ return gic;
+}
+EXPORT_SYMBOL_NS(mana_gd_get_gic, "NET_MANA");
+
int mana_gd_alloc_res_map(u32 res_avail, struct gdma_resource *r)
{
r->map = bitmap_zalloc(res_avail, GFP_KERNEL);
@@ -2076,6 +2234,7 @@ static int mana_gd_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
goto release_region;
mutex_init(&gc->eq_test_event_mutex);
+ mutex_init(&gc->gic_mutex);
pci_set_drvdata(pdev, gc);
gc->bar0_pa = pci_resource_start(pdev, 0);
gc->bar0_size = pci_resource_len(pdev, 0);
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index ecd9949df213..4614a6a7271b 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -388,6 +388,11 @@ struct gdma_irq_context {
spinlock_t lock;
struct list_head eq_list;
char name[MANA_IRQ_NAME_SZ];
+ unsigned int msi;
+ unsigned int irq;
+ refcount_t refcount;
+ unsigned int bitmap_refs;
+ bool dyn_msix;
};
enum gdma_context_flags {
@@ -449,6 +454,9 @@ struct gdma_context {
unsigned long flags;
+ /* Protect access to GIC context */
+ struct mutex gic_mutex;
+
/* Indicate if this device is sharing MSI for EQs on MANA */
bool msi_sharing;
@@ -1021,6 +1029,9 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+struct gdma_irq_context *mana_gd_get_gic(struct gdma_context *gc, bool use_msi_bitmap,
+ int *msi_requested);
+void mana_gd_put_gic(struct gdma_context *gc, bool use_msi_bitmap, int msi);
int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
u32 proto_minor_ver, u32 proto_micro_ver,
u16 *max_num_vports, u8 *bm_hostmode);
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v5 2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
From: Long Li @ 2026-03-23 19:59 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260323195952.1767304-1-longli@microsoft.com>
When querying the device, adjust the max number of queues to allow
dedicated MSI-X vectors for each vPort. The number of queues per vPort
is clamped to no less than MANA_DEF_NUM_QUEUES. MSI-X sharing among
vPorts is disabled by default and is only enabled when there are not
enough MSI-X vectors for dedicated allocation.
Rename mana_query_device_cfg() to mana_gd_query_device_cfg() as it is
used at GDMA device probe time for querying device capabilities.
Signed-off-by: Long Li <longli@microsoft.com>
---
Changes in v4:
- Use MANA_DEF_NUM_QUEUES instead of hardcoded 16 for max_num_queues
clamping
Changes in v2:
- Fixed misleading comment for max_num_queues vs max_num_queues_vport
in gdma.h
---
.../net/ethernet/microsoft/mana/gdma_main.c | 66 ++++++++++++++++---
drivers/net/ethernet/microsoft/mana/mana_en.c | 36 +++++-----
include/net/mana/gdma.h | 13 +++-
3 files changed, 91 insertions(+), 24 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
index 2ba1fa3336f9..ae18b4054a02 100644
--- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
+++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
@@ -124,6 +124,9 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
struct gdma_context *gc = pci_get_drvdata(pdev);
struct gdma_query_max_resources_resp resp = {};
struct gdma_general_req req = {};
+ unsigned int max_num_queues;
+ u8 bm_hostmode;
+ u16 num_ports;
int err;
mana_gd_init_req_hdr(&req.hdr, GDMA_QUERY_MAX_RESOURCES,
@@ -169,6 +172,40 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
if (gc->max_num_queues > gc->num_msix_usable - 1)
gc->max_num_queues = gc->num_msix_usable - 1;
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+ if (err)
+ return err;
+
+ if (!num_ports)
+ return -EINVAL;
+
+ /*
+ * Adjust gc->max_num_queues returned from the SOC to allow dedicated
+ * MSIx for each vPort. Clamp to no less than MANA_DEF_NUM_QUEUES.
+ */
+ max_num_queues = (gc->num_msix_usable - 1) / num_ports;
+ max_num_queues = roundup_pow_of_two(max(max_num_queues, 1U));
+ if (max_num_queues < MANA_DEF_NUM_QUEUES)
+ max_num_queues = MANA_DEF_NUM_QUEUES;
+
+ /*
+ * Use dedicated MSIx for EQs whenever possible, use MSIx sharing for
+ * Ethernet EQs when (max_num_queues * num_ports > num_msix_usable - 1)
+ */
+ max_num_queues = min(gc->max_num_queues, max_num_queues);
+ if (max_num_queues * num_ports > gc->num_msix_usable - 1)
+ gc->msi_sharing = true;
+
+ /* If MSI is shared, use max allowed value */
+ if (gc->msi_sharing)
+ gc->max_num_queues_vport = min(gc->num_msix_usable - 1, gc->max_num_queues);
+ else
+ gc->max_num_queues_vport = max_num_queues;
+
+ dev_info(gc->dev, "MSI sharing mode %d max queues %d\n",
+ gc->msi_sharing, gc->max_num_queues);
+
return 0;
}
@@ -1831,6 +1868,7 @@ static int mana_gd_setup_hwc_irqs(struct pci_dev *pdev)
/* Need 1 interrupt for HWC */
max_irqs = min(num_online_cpus(), MANA_MAX_NUM_QUEUES) + 1;
min_irqs = 2;
+ gc->msi_sharing = true;
}
nvec = pci_alloc_irq_vectors(pdev, min_irqs, max_irqs, PCI_IRQ_MSIX);
@@ -1909,6 +1947,8 @@ static void mana_gd_remove_irqs(struct pci_dev *pdev)
pci_free_irq_vectors(pdev);
+ bitmap_free(gc->msi_bitmap);
+ gc->msi_bitmap = NULL;
gc->max_num_msix = 0;
gc->num_msix_usable = 0;
}
@@ -1943,20 +1983,30 @@ static int mana_gd_setup(struct pci_dev *pdev)
if (err)
goto destroy_hwc;
- err = mana_gd_query_max_resources(pdev);
+ err = mana_gd_detect_devices(pdev);
if (err)
goto destroy_hwc;
- err = mana_gd_setup_remaining_irqs(pdev);
- if (err) {
- dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
- goto destroy_hwc;
- }
-
- err = mana_gd_detect_devices(pdev);
+ err = mana_gd_query_max_resources(pdev);
if (err)
goto destroy_hwc;
+ if (!gc->msi_sharing) {
+ gc->msi_bitmap = bitmap_zalloc(gc->num_msix_usable, GFP_KERNEL);
+ if (!gc->msi_bitmap) {
+ err = -ENOMEM;
+ goto destroy_hwc;
+ }
+ /* Set bit for HWC */
+ set_bit(0, gc->msi_bitmap);
+ } else {
+ err = mana_gd_setup_remaining_irqs(pdev);
+ if (err) {
+ dev_err(gc->dev, "Failed to setup remaining IRQs: %d", err);
+ goto destroy_hwc;
+ }
+ }
+
dev_dbg(&pdev->dev, "mana gdma setup successful\n");
return 0;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 178c583d74b4..004d48bba8aa 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1000,10 +1000,9 @@ static int mana_init_port_context(struct mana_port_context *apc)
return !apc->rxqs ? -ENOMEM : 0;
}
-static int mana_send_request(struct mana_context *ac, void *in_buf,
- u32 in_len, void *out_buf, u32 out_len)
+static int gdma_mana_send_request(struct gdma_context *gc, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_resp_hdr *resp = out_buf;
struct gdma_req_hdr *req = in_buf;
struct device *dev = gc->dev;
@@ -1037,6 +1036,14 @@ static int mana_send_request(struct mana_context *ac, void *in_buf,
return 0;
}
+static int mana_send_request(struct mana_context *ac, void *in_buf,
+ u32 in_len, void *out_buf, u32 out_len)
+{
+ struct gdma_context *gc = ac->gdma_dev->gdma_context;
+
+ return gdma_mana_send_request(gc, in_buf, in_len, out_buf, out_len);
+}
+
static int mana_verify_resp_hdr(const struct gdma_resp_hdr *resp_hdr,
const enum mana_command_code expected_code,
const u32 min_size)
@@ -1170,11 +1177,10 @@ static void mana_pf_deregister_filter(struct mana_port_context *apc)
err, resp.hdr.status);
}
-static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
- u32 proto_minor_ver, u32 proto_micro_ver,
- u16 *max_num_vports, u8 *bm_hostmode)
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode)
{
- struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct mana_query_device_cfg_resp resp = {};
struct mana_query_device_cfg_req req = {};
struct device *dev = gc->dev;
@@ -1189,7 +1195,7 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
req.proto_minor_ver = proto_minor_ver;
req.proto_micro_ver = proto_micro_ver;
- err = mana_send_request(ac, &req, sizeof(req), &resp, sizeof(resp));
+ err = gdma_mana_send_request(gc, &req, sizeof(req), &resp, sizeof(resp));
if (err) {
dev_err(dev, "Failed to query config: %d", err);
return err;
@@ -1217,8 +1223,6 @@ static int mana_query_device_cfg(struct mana_context *ac, u32 proto_major_ver,
else
*bm_hostmode = 0;
- debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
-
return 0;
}
@@ -3373,7 +3377,7 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
int err;
ndev = alloc_etherdev_mq(sizeof(struct mana_port_context),
- gc->max_num_queues);
+ gc->max_num_queues_vport);
if (!ndev)
return -ENOMEM;
@@ -3382,9 +3386,9 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc = netdev_priv(ndev);
apc->ac = ac;
apc->ndev = ndev;
- apc->max_queues = gc->max_num_queues;
+ apc->max_queues = gc->max_num_queues_vport;
/* Use MANA_DEF_NUM_QUEUES as default, still honoring the HW limit */
- apc->num_queues = min(gc->max_num_queues, MANA_DEF_NUM_QUEUES);
+ apc->num_queues = min(gc->max_num_queues_vport, MANA_DEF_NUM_QUEUES);
apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
apc->port_handle = INVALID_MANA_HANDLE;
@@ -3644,13 +3648,15 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
gd->driver_data = ac;
}
- err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
- MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
+ err = mana_gd_query_device_cfg(gc, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
+ MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
if (err)
goto out;
ac->bm_hostmode = bm_hostmode;
+ debugfs_create_u16("adapter-MTU", 0400, gc->mana_pci_debugfs, &gc->adapter_mtu);
+
if (!resuming) {
ac->num_ports = num_ports;
diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
index 7fe3a1b61b2d..ecd9949df213 100644
--- a/include/net/mana/gdma.h
+++ b/include/net/mana/gdma.h
@@ -399,8 +399,10 @@ struct gdma_context {
struct device *dev;
struct dentry *mana_pci_debugfs;
- /* Per-vPort max number of queues */
+ /* Hardware max number of queues */
unsigned int max_num_queues;
+ /* Per-vPort max number of queues */
+ unsigned int max_num_queues_vport;
unsigned int max_num_msix;
unsigned int num_msix_usable;
struct xarray irq_contexts;
@@ -446,6 +448,12 @@ struct gdma_context {
struct workqueue_struct *service_wq;
unsigned long flags;
+
+ /* Indicate if this device is sharing MSI for EQs on MANA */
+ bool msi_sharing;
+
+ /* Bitmap tracks where MSI is allocated when it is not shared for EQs */
+ unsigned long *msi_bitmap;
};
static inline bool mana_gd_is_mana(struct gdma_dev *gd)
@@ -1013,4 +1021,7 @@ int mana_gd_resume(struct pci_dev *pdev);
bool mana_need_log(struct gdma_context *gc, int err);
+int mana_gd_query_device_cfg(struct gdma_context *gc, u32 proto_major_ver,
+ u32 proto_minor_ver, u32 proto_micro_ver,
+ u16 *max_num_vports, u8 *bm_hostmode);
#endif /* _GDMA_H */
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v5 1/6] net: mana: Create separate EQs for each vPort
From: Long Li @ 2026-03-23 19:59 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260323195952.1767304-1-longli@microsoft.com>
To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
sharing among the vPorts and create dedicated EQs for each vPort.
Move the EQ definition from struct mana_context to struct mana_port_context
and update related support functions. Export mana_create_eq() and
mana_destroy_eq() for use by the MANA RDMA driver.
Signed-off-by: Long Li <longli@microsoft.com>
---
Changes in v3:
- Added NULL check for mpc->eqs in mana_ib_create_qp_rss() to prevent
kernel crash when RSS QP is created before EQs are allocated
---
drivers/infiniband/hw/mana/main.c | 14 ++-
drivers/infiniband/hw/mana/qp.c | 16 ++-
drivers/net/ethernet/microsoft/mana/mana_en.c | 109 ++++++++++--------
include/net/mana/mana.h | 7 +-
4 files changed, 94 insertions(+), 52 deletions(-)
diff --git a/drivers/infiniband/hw/mana/main.c b/drivers/infiniband/hw/mana/main.c
index 8d99cd00f002..d51dd0ee85f4 100644
--- a/drivers/infiniband/hw/mana/main.c
+++ b/drivers/infiniband/hw/mana/main.c
@@ -20,8 +20,10 @@ void mana_ib_uncfg_vport(struct mana_ib_dev *dev, struct mana_ib_pd *pd,
pd->vport_use_count--;
WARN_ON(pd->vport_use_count < 0);
- if (!pd->vport_use_count)
+ if (!pd->vport_use_count) {
+ mana_destroy_eq(mpc);
mana_uncfg_vport(mpc);
+ }
mutex_unlock(&pd->vport_mutex);
}
@@ -55,15 +57,21 @@ int mana_ib_cfg_vport(struct mana_ib_dev *dev, u32 port, struct mana_ib_pd *pd,
return err;
}
- mutex_unlock(&pd->vport_mutex);
pd->tx_shortform_allowed = mpc->tx_shortform_allowed;
pd->tx_vp_offset = mpc->tx_vp_offset;
+ err = mana_create_eq(mpc);
+ if (err) {
+ mana_uncfg_vport(mpc);
+ pd->vport_use_count--;
+ }
+
+ mutex_unlock(&pd->vport_mutex);
ibdev_dbg(&dev->ib_dev, "vport handle %llx pdid %x doorbell_id %x\n",
mpc->port_handle, pd->pdn, doorbell_id);
- return 0;
+ return err;
}
int mana_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
diff --git a/drivers/infiniband/hw/mana/qp.c b/drivers/infiniband/hw/mana/qp.c
index 82f84f7ad37a..80cf4ade4b75 100644
--- a/drivers/infiniband/hw/mana/qp.c
+++ b/drivers/infiniband/hw/mana/qp.c
@@ -188,7 +188,15 @@ static int mana_ib_create_qp_rss(struct ib_qp *ibqp, struct ib_pd *pd,
cq_spec.gdma_region = cq->queue.gdma_region;
cq_spec.queue_size = cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
- eq = &mpc->ac->eqs[cq->comp_vector];
+ /* EQs are created when a raw QP configures the vport.
+ * A raw QP must be created before creating rwq_ind_tbl.
+ */
+ if (!mpc->eqs) {
+ ret = -EINVAL;
+ i--;
+ goto fail;
+ }
+ eq = &mpc->eqs[cq->comp_vector % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
ret = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_RQ,
@@ -340,7 +348,11 @@ static int mana_ib_create_qp_raw(struct ib_qp *ibqp, struct ib_pd *ibpd,
cq_spec.queue_size = send_cq->cqe * COMP_ENTRY_SIZE;
cq_spec.modr_ctx_id = 0;
eq_vec = send_cq->comp_vector;
- eq = &mpc->ac->eqs[eq_vec];
+ if (!mpc->eqs) {
+ err = -EINVAL;
+ goto err_destroy_queue;
+ }
+ eq = &mpc->eqs[eq_vec % mpc->num_queues];
cq_spec.attached_eq = eq->eq->id;
err = mana_create_wq_obj(mpc, mpc->port_handle, GDMA_SQ, &wq_spec,
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index b39e8b920791..178c583d74b4 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -1596,78 +1596,82 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
}
EXPORT_SYMBOL_NS(mana_destroy_wq_obj, "NET_MANA");
-static void mana_destroy_eq(struct mana_context *ac)
+void mana_destroy_eq(struct mana_port_context *apc)
{
+ struct mana_context *ac = apc->ac;
struct gdma_context *gc = ac->gdma_dev->gdma_context;
struct gdma_queue *eq;
int i;
- if (!ac->eqs)
+ if (!apc->eqs)
return;
- debugfs_remove_recursive(ac->mana_eqs_debugfs);
- ac->mana_eqs_debugfs = NULL;
+ debugfs_remove_recursive(apc->mana_eqs_debugfs);
+ apc->mana_eqs_debugfs = NULL;
- for (i = 0; i < gc->max_num_queues; i++) {
- eq = ac->eqs[i].eq;
+ for (i = 0; i < apc->num_queues; i++) {
+ eq = apc->eqs[i].eq;
if (!eq)
continue;
mana_gd_destroy_queue(gc, eq);
}
- kfree(ac->eqs);
- ac->eqs = NULL;
+ kfree(apc->eqs);
+ apc->eqs = NULL;
}
+EXPORT_SYMBOL_NS(mana_destroy_eq, "NET_MANA");
-static void mana_create_eq_debugfs(struct mana_context *ac, int i)
+static void mana_create_eq_debugfs(struct mana_port_context *apc, int i)
{
- struct mana_eq eq = ac->eqs[i];
+ struct mana_eq eq = apc->eqs[i];
char eqnum[32];
sprintf(eqnum, "eq%d", i);
- eq.mana_eq_debugfs = debugfs_create_dir(eqnum, ac->mana_eqs_debugfs);
+ eq.mana_eq_debugfs = debugfs_create_dir(eqnum, apc->mana_eqs_debugfs);
debugfs_create_u32("head", 0400, eq.mana_eq_debugfs, &eq.eq->head);
debugfs_create_u32("tail", 0400, eq.mana_eq_debugfs, &eq.eq->tail);
debugfs_create_file("eq_dump", 0400, eq.mana_eq_debugfs, eq.eq, &mana_dbg_q_fops);
}
-static int mana_create_eq(struct mana_context *ac)
+int mana_create_eq(struct mana_port_context *apc)
{
- struct gdma_dev *gd = ac->gdma_dev;
+ struct gdma_dev *gd = apc->ac->gdma_dev;
struct gdma_context *gc = gd->gdma_context;
struct gdma_queue_spec spec = {};
int err;
int i;
- ac->eqs = kzalloc_objs(struct mana_eq, gc->max_num_queues);
- if (!ac->eqs)
+ WARN_ON(apc->eqs);
+ apc->eqs = kzalloc_objs(struct mana_eq, apc->num_queues);
+ if (!apc->eqs)
return -ENOMEM;
spec.type = GDMA_EQ;
spec.monitor_avl_buf = false;
spec.queue_size = EQ_SIZE;
spec.eq.callback = NULL;
- spec.eq.context = ac->eqs;
+ spec.eq.context = apc->eqs;
spec.eq.log2_throttle_limit = LOG2_EQ_THROTTLE;
- ac->mana_eqs_debugfs = debugfs_create_dir("EQs", gc->mana_pci_debugfs);
+ apc->mana_eqs_debugfs = debugfs_create_dir("EQs", apc->mana_port_debugfs);
- for (i = 0; i < gc->max_num_queues; i++) {
+ for (i = 0; i < apc->num_queues; i++) {
spec.eq.msix_index = (i + 1) % gc->num_msix_usable;
- err = mana_gd_create_mana_eq(gd, &spec, &ac->eqs[i].eq);
+ err = mana_gd_create_mana_eq(gd, &spec, &apc->eqs[i].eq);
if (err) {
dev_err(gc->dev, "Failed to create EQ %d : %d\n", i, err);
goto out;
}
- mana_create_eq_debugfs(ac, i);
+ mana_create_eq_debugfs(apc, i);
}
return 0;
out:
- mana_destroy_eq(ac);
+ mana_destroy_eq(apc);
return err;
}
+EXPORT_SYMBOL_NS(mana_create_eq, "NET_MANA");
static int mana_fence_rq(struct mana_port_context *apc, struct mana_rxq *rxq)
{
@@ -2421,7 +2425,7 @@ static int mana_create_txq(struct mana_port_context *apc,
spec.monitor_avl_buf = false;
spec.queue_size = cq_size;
spec.cq.callback = mana_schedule_napi;
- spec.cq.parent_eq = ac->eqs[i].eq;
+ spec.cq.parent_eq = apc->eqs[i].eq;
spec.cq.context = cq;
err = mana_gd_create_mana_wq_cq(gd, &spec, &cq->gdma_cq);
if (err)
@@ -2814,13 +2818,12 @@ static void mana_create_rxq_debugfs(struct mana_port_context *apc, int idx)
static int mana_add_rx_queues(struct mana_port_context *apc,
struct net_device *ndev)
{
- struct mana_context *ac = apc->ac;
struct mana_rxq *rxq;
int err = 0;
int i;
for (i = 0; i < apc->num_queues; i++) {
- rxq = mana_create_rxq(apc, i, &ac->eqs[i], ndev);
+ rxq = mana_create_rxq(apc, i, &apc->eqs[i], ndev);
if (!rxq) {
err = -ENOMEM;
netdev_err(ndev, "Failed to create rxq %d : %d\n", i, err);
@@ -2839,9 +2842,8 @@ static int mana_add_rx_queues(struct mana_port_context *apc,
return err;
}
-static void mana_destroy_vport(struct mana_port_context *apc)
+static void mana_destroy_rxqs(struct mana_port_context *apc)
{
- struct gdma_dev *gd = apc->ac->gdma_dev;
struct mana_rxq *rxq;
u32 rxq_idx;
@@ -2853,8 +2855,12 @@ static void mana_destroy_vport(struct mana_port_context *apc)
mana_destroy_rxq(apc, rxq, true);
apc->rxqs[rxq_idx] = NULL;
}
+}
+
+static void mana_destroy_vport(struct mana_port_context *apc)
+{
+ struct gdma_dev *gd = apc->ac->gdma_dev;
- mana_destroy_txq(apc);
mana_uncfg_vport(apc);
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode)
@@ -2875,11 +2881,7 @@ static int mana_create_vport(struct mana_port_context *apc,
return err;
}
- err = mana_cfg_vport(apc, gd->pdid, gd->doorbell);
- if (err)
- return err;
-
- return mana_create_txq(apc, net);
+ return mana_cfg_vport(apc, gd->pdid, gd->doorbell);
}
static int mana_rss_table_alloc(struct mana_port_context *apc)
@@ -3156,21 +3158,36 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_create_vport(apc, ndev);
if (err) {
- netdev_err(ndev, "Failed to create vPort %u : %d\n", apc->port_idx, err);
+ netdev_err(ndev, "Failed to create vPort %u : %d\n",
+ apc->port_idx, err);
return err;
}
+ err = mana_create_eq(apc);
+ if (err) {
+ netdev_err(ndev, "Failed to create EQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_vport;
+ }
+
+ err = mana_create_txq(apc, ndev);
+ if (err) {
+ netdev_err(ndev, "Failed to create TXQ on vPort %u: %d\n",
+ apc->port_idx, err);
+ goto destroy_eq;
+ }
+
err = netif_set_real_num_tx_queues(ndev, apc->num_queues);
if (err) {
netdev_err(ndev,
"netif_set_real_num_tx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_txq;
}
err = mana_add_rx_queues(apc, ndev);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
apc->rss_state = apc->num_queues > 1 ? TRI_STATE_TRUE : TRI_STATE_FALSE;
@@ -3179,7 +3196,7 @@ int mana_alloc_queues(struct net_device *ndev)
netdev_err(ndev,
"netif_set_real_num_rx_queues () failed for ndev with num_queues %u : %d\n",
apc->num_queues, err);
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_rss_table_init(apc);
@@ -3187,19 +3204,25 @@ int mana_alloc_queues(struct net_device *ndev)
err = mana_config_rss(apc, TRI_STATE_TRUE, true, true);
if (err) {
netdev_err(ndev, "Failed to configure RSS table: %d\n", err);
- goto destroy_vport;
+ goto destroy_rxq;
}
if (gd->gdma_context->is_pf && !apc->ac->bm_hostmode) {
err = mana_pf_register_filter(apc);
if (err)
- goto destroy_vport;
+ goto destroy_rxq;
}
mana_chn_setxdp(apc, mana_xdp_get(apc));
return 0;
+destroy_rxq:
+ mana_destroy_rxqs(apc);
+destroy_txq:
+ mana_destroy_txq(apc);
+destroy_eq:
+ mana_destroy_eq(apc);
destroy_vport:
mana_destroy_vport(apc);
return err;
@@ -3302,6 +3325,9 @@ static int mana_dealloc_queues(struct net_device *ndev)
netdev_err(ndev, "Failed to disable vPort: %d\n", err);
/* Even in err case, still need to cleanup the vPort */
+ mana_destroy_rxqs(apc);
+ mana_destroy_txq(apc);
+ mana_destroy_eq(apc);
mana_destroy_vport(apc);
return 0;
@@ -3618,12 +3644,6 @@ int mana_probe(struct gdma_dev *gd, bool resuming)
gd->driver_data = ac;
}
- err = mana_create_eq(ac);
- if (err) {
- dev_err(dev, "Failed to create EQs: %d\n", err);
- goto out;
- }
-
err = mana_query_device_cfg(ac, MANA_MAJOR_VERSION, MANA_MINOR_VERSION,
MANA_MICRO_VERSION, &num_ports, &bm_hostmode);
if (err)
@@ -3762,7 +3782,6 @@ void mana_remove(struct gdma_dev *gd, bool suspending)
free_netdev(ndev);
}
- mana_destroy_eq(ac);
out:
if (ac->per_port_queue_reset_wq) {
destroy_workqueue(ac->per_port_queue_reset_wq);
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 96d21cbbdee2..204c2b612a62 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -480,8 +480,6 @@ struct mana_context {
u8 bm_hostmode;
struct mana_ethtool_hc_stats hc_stats;
- struct mana_eq *eqs;
- struct dentry *mana_eqs_debugfs;
struct workqueue_struct *per_port_queue_reset_wq;
/* Workqueue for querying hardware stats */
struct delayed_work gf_stats_work;
@@ -501,6 +499,9 @@ struct mana_port_context {
u8 mac_addr[ETH_ALEN];
+ struct mana_eq *eqs;
+ struct dentry *mana_eqs_debugfs;
+
enum TRI_STATE rss_state;
mana_handle_t default_rxobj;
@@ -1033,6 +1034,8 @@ void mana_destroy_wq_obj(struct mana_port_context *apc, u32 wq_type,
int mana_cfg_vport(struct mana_port_context *apc, u32 protection_dom_id,
u32 doorbell_pg_id);
void mana_uncfg_vport(struct mana_port_context *apc);
+int mana_create_eq(struct mana_port_context *apc);
+void mana_destroy_eq(struct mana_port_context *apc);
struct net_device *mana_get_primary_netdev(struct mana_context *ac,
u32 port_index,
--
2.43.0
^ permalink raw reply related
* [PATCH net-next v5 0/6] net: mana: Per-vPort EQ and MSI-X interrupt management
From: Long Li @ 2026-03-23 19:59 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
This series adds per-vPort Event Queue (EQ) allocation and MSI-X interrupt
management for the MANA driver. Previously, all vPorts shared a single set
of EQs. This change enables dedicated EQs per vPort with support for both
dedicated and shared MSI-X vector allocation modes.
Patch 1 moves EQ ownership from mana_context to per-vPort mana_port_context
and exports create/destroy functions for the RDMA driver.
Patch 2 adds device capability queries to determine whether MSI-X vectors
should be dedicated per-vPort or shared. When the number of available MSI-X
vectors is insufficient for dedicated allocation, the driver enables sharing
mode with bitmap-based vector assignment.
Patch 3 introduces the GIC (GDMA IRQ Context) abstraction with reference
counting, allowing multiple EQs to safely share a single MSI-X vector.
Patch 4 converts the global EQ allocation in probe/resume to use the new
GIC functions.
Patch 5 adds per-vPort GIC lifecycle management, calling get/put on each
EQ creation and destruction during vPort open/close.
Patch 6 extends the same GIC lifecycle management to the RDMA driver's EQ
allocation path.
Changes in v5:
- Rebased on net-next/main
Changes in v4:
- Rebased on net-next/main 7.0-rc4
- Patch 2: Use MANA_DEF_NUM_QUEUES instead of hardcoded 16 for
max_num_queues clamping
- Patch 3: Track dyn_msix in GIC context instead of re-checking
pci_msix_can_alloc_dyn() on each call; improved remove_irqs iteration
to skip unallocated entries
Changes in v3:
- Rebased on net-next/main
- Patch 1: Added NULL check for mpc->eqs in mana_ib_create_qp_rss() to
prevent NULL pointer dereference when RSS QP is created before a raw QP
has configured the vport and allocated EQs
Changes in v2:
- Rebased on net-next/main (adapted to kzalloc_objs/kzalloc_obj macros,
new GDMA_DRV_CAP_FLAG definitions)
- Patch 2: Fixed misleading comment for max_num_queues vs
max_num_queues_vport in gdma.h
- Patch 3: Fixed spelling typo in gdma_main.c ("difference" -> "different")
Long Li (6):
net: mana: Create separate EQs for each vPort
net: mana: Query device capabilities and configure MSI-X sharing for
EQs
net: mana: Introduce GIC context with refcounting for interrupt
management
net: mana: Use GIC functions to allocate global EQs
net: mana: Allocate interrupt context for each EQ when creating vPort
RDMA/mana_ib: Allocate interrupt contexts on EQs
drivers/infiniband/hw/mana/main.c | 47 ++-
drivers/infiniband/hw/mana/qp.c | 16 +-
.../net/ethernet/microsoft/mana/gdma_main.c | 307 +++++++++++++-----
drivers/net/ethernet/microsoft/mana/mana_en.c | 162 +++++----
include/net/mana/gdma.h | 32 +-
include/net/mana/mana.h | 7 +-
6 files changed, 416 insertions(+), 155 deletions(-)
--
2.43.0
^ permalink raw reply
* [PATCH net-next v2] net: mana: Set default number of queues to 16
From: Long Li @ 2026-03-23 19:49 UTC (permalink / raw)
To: Long Li, Konstantin Taranov, Jakub Kicinski, David S . Miller,
Paolo Abeni, Eric Dumazet, Andrew Lunn, Jason Gunthorpe,
Leon Romanovsky, Haiyang Zhang, K . Y . Srinivasan, Wei Liu,
Dexuan Cui
Cc: Simon Horman, netdev, linux-rdma, linux-hyperv, linux-kernel
Set the default number of queues per vPort to MANA_DEF_NUM_QUEUES (16),
as 16 queues can achieve optimal throughput for typical workloads. The
actual number of queues may be lower if it exceeds the hardware reported
limit. Users can increase the number of queues up to max_queues via
ethtool if needed.
Signed-off-by: Long Li <longli@microsoft.com>
---
v2:
- Updated commit message to clarify that the actual number of queues
may be lower if it exceeds the hardware reported limit.
drivers/net/ethernet/microsoft/mana/mana_en.c | 3 ++-
include/net/mana/mana.h | 1 +
2 files changed, 3 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..b39e8b920791 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3357,7 +3357,8 @@ static int mana_probe_port(struct mana_context *ac, int port_idx,
apc->ac = ac;
apc->ndev = ndev;
apc->max_queues = gc->max_num_queues;
- apc->num_queues = gc->max_num_queues;
+ /* Use MANA_DEF_NUM_QUEUES as default, still honoring the HW limit */
+ apc->num_queues = min(gc->max_num_queues, MANA_DEF_NUM_QUEUES);
apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
apc->port_handle = INVALID_MANA_HANDLE;
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..96d21cbbdee2 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -1007,6 +1007,7 @@ struct mana_deregister_filter_resp {
#define STATISTICS_FLAGS_TX_ERRORS_GDMA_ERROR 0x0000000004000000
#define MANA_MAX_NUM_QUEUES 64
+#define MANA_DEF_NUM_QUEUES 16
#define MANA_SHORT_VPORT_OFFSET_MAX ((1U << 8) - 1)
--
2.43.0
^ permalink raw reply related
* RE: [EXTERNAL] Re: [PATCH rdma] RDMA/mana_ib: Disable RX steering on RSS QP destroy
From: Long Li @ 2026-03-23 18:03 UTC (permalink / raw)
To: Leon Romanovsky
Cc: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Haiyang Zhang,
KY Srinivasan, Wei Liu, Dexuan Cui, Simon Horman,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
stable@vger.kernel.org
In-Reply-To: <20260322184848.GC814676@unreal>
> On Fri, Mar 20, 2026 at 05:28:42PM -0700, Long Li wrote:
> > When an RSS QP is destroyed (e.g. DPDK exit), mana_ib_destroy_qp_rss()
> > destroys the RX WQ objects but does not disable vPort RX steering in
> > firmware. This leaves stale steering configuration that still points
> > to the destroyed RX objects.
> >
> > If traffic continues to arrive (e.g. peer VM is still transmitting)
> > and the VF interface is subsequently brought up (mana_open), the
> > firmware may deliver completions using stale CQ IDs from the old RX objects.
> > These CQ IDs can be reused by the ethernet driver for new TX CQs,
> > causing RX completions to land on TX CQs:
> >
> > WARNING: mana_poll_tx_cq+0x1b8/0x220 [mana] (is_sq == false)
> > WARNING: mana_gd_process_eq_events+0x209/0x290 (cq_table lookup
> > fails)
> >
> > Fix this by disabling vPort RX steering before destroying RX WQ objects.
> > Note that mana_fence_rqs() cannot be used here because the fence
> > completion is delivered on the CQ, which is polled by user-mode (e.g.
> > DPDK) and not visible to the kernel driver.
> >
> > Refactor the disable logic into a shared mana_disable_vport_rx() in
> > mana_en, exported for use by mana_ib, replacing the duplicate code.
> > The ethernet driver's mana_dealloc_queues() is also updated to call
> > this common function.
> >
> > Fixes: 0266a177631d ("RDMA/mana_ib: Add a driver for Microsoft Azure
> > Network Adapter")
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> > drivers/infiniband/hw/mana/qp.c | 17 ++++++++++++++++-
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 11 ++++++++++-
> > include/net/mana/mana.h | 1 +
> > 3 files changed, 27 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/infiniband/hw/mana/qp.c
> > b/drivers/infiniband/hw/mana/qp.c index 80cf4ade4b75..b27084c53a14
> > 100644
> > --- a/drivers/infiniband/hw/mana/qp.c
> > +++ b/drivers/infiniband/hw/mana/qp.c
> > @@ -829,11 +829,26 @@ static int mana_ib_destroy_qp_rss(struct
> mana_ib_qp *qp,
> > struct net_device *ndev;
> > struct mana_ib_wq *wq;
> > struct ib_wq *ibwq;
> > - int i;
> > + int i, err;
> >
> > ndev = mana_ib_get_netdev(qp->ibqp.device, qp->port);
> > mpc = netdev_priv(ndev);
> >
> > + /* Disable vPort RX steering before destroying RX WQ objects.
> > + * Otherwise firmware still routes traffic to the destroyed queues,
> > + * which can cause bogus completions on reused CQ IDs when the
> > + * ethernet driver later creates new queues on mana_open().
> > + *
> > + * Unlike the ethernet teardown path, mana_fence_rqs() cannot be
> > + * used here because the fence completion CQE is delivered on the
> > + * CQ which is polled by userspace (e.g. DPDK), so there is no way
> > + * for the kernel to wait for fence completion.
> > + */
> > + err = mana_disable_vport_rx(mpc);
> > + if (err)
> > + ibdev_err(&mdev->ib_dev,
> > + "Failed to disable vPort RX: %d\n", err);
>
> mana_cfg_vport_steering() is already prints in all failure scenarios.
>
> Thanks
I'm sending v2 with this message removed.
Thanks,
Long
>
> > +
> > for (i = 0; i < (1 << ind_tbl->log_ind_tbl_size); i++) {
> > ibwq = ind_tbl->ind_tbl[i];
> > wq = container_of(ibwq, struct mana_ib_wq, ibwq); diff --git
> > a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 22444c7530a5..51719ef1c09b 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -2934,6 +2934,13 @@ static void mana_rss_table_init(struct
> mana_port_context *apc)
> > ethtool_rxfh_indir_default(i, apc->num_queues); }
> >
> > +int mana_disable_vport_rx(struct mana_port_context *apc) {
> > + return mana_cfg_vport_steering(apc, TRI_STATE_FALSE, false, false,
> > + false);
> > +}
> > +EXPORT_SYMBOL_NS(mana_disable_vport_rx, "NET_MANA");
> > +
> > int mana_config_rss(struct mana_port_context *apc, enum TRI_STATE rx,
> > bool update_hash, bool update_tab) { @@ -3339,10 +3346,12
> @@
> > static int mana_dealloc_queues(struct net_device *ndev)
> > */
> >
> > apc->rss_state = TRI_STATE_FALSE;
> > - err = mana_config_rss(apc, TRI_STATE_FALSE, false, false);
> > + err = mana_disable_vport_rx(apc);
> > if (err && mana_en_need_log(apc, err))
> > netdev_err(ndev, "Failed to disable vPort: %d\n", err);
> >
> > + mana_fence_rqs(apc);
> > +
> > /* Even in err case, still need to cleanup the vPort */
> > mana_destroy_rxqs(apc);
> > mana_destroy_txq(apc);
> > diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h index
> > 204c2b612a62..2634e9135eed 100644
> > --- a/include/net/mana/mana.h
> > +++ b/include/net/mana/mana.h
> > @@ -574,6 +574,7 @@ struct mana_port_context { netdev_tx_t
> > mana_start_xmit(struct sk_buff *skb, struct net_device *ndev); int
> > mana_config_rss(struct mana_port_context *ac, enum TRI_STATE rx,
> > bool update_hash, bool update_tab);
> > +int mana_disable_vport_rx(struct mana_port_context *apc);
> >
> > int mana_alloc_queues(struct net_device *ndev); int
> > mana_attach(struct net_device *ndev);
> > --
> > 2.43.0
> >
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH net-next] net: mana: Set default number of queues to 16
From: Long Li @ 2026-03-23 17:54 UTC (permalink / raw)
To: Simon Horman
Cc: Konstantin Taranov, Jakub Kicinski, David S . Miller, Paolo Abeni,
Eric Dumazet, Andrew Lunn, Jason Gunthorpe, Leon Romanovsky,
Haiyang Zhang, KY Srinivasan, Wei Liu, Dexuan Cui,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260323135848.GA81558@horms.kernel.org>
>
> On Fri, Mar 20, 2026 at 04:30:27PM -0700, Long Li wrote:
> > Set the default number of queues per vPort to MANA_DEF_NUM_QUEUES
> > (16), as 16 queues can achieve optimal throughput for typical
> > workloads. Users can increase the number of queues up to max_queues via
> ethtool if needed.
> >
> > Signed-off-by: Long Li <longli@microsoft.com>
> > ---
> > drivers/net/ethernet/microsoft/mana/mana_en.c | 2 +-
> > include/net/mana/mana.h | 1 +
> > 2 files changed, 2 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index 49c65cc1697c..7cae8a7b9f31 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -3357,7 +3357,7 @@ static int mana_probe_port(struct mana_context
> *ac, int port_idx,
> > apc->ac = ac;
> > apc->ndev = ndev;
> > apc->max_queues = gc->max_num_queues;
> > - apc->num_queues = gc->max_num_queues;
> > + apc->num_queues = min(gc->max_num_queues,
> MANA_DEF_NUM_QUEUES);
>
> Hi Long Li,
>
> Maybe I am misunderstanding things. But it seems to me that this patch sets a
> ceiling on the default number of queues. Which is subtly different to setting the
> default. Even if not in practice if max_num_queues is never less than
> MANA_DEF_NUM_QUEUES.
>
> If so I'm wondering if you could tweak the commit message accordingly.
Yes, will tweak the commit message and resend patch.
Thanks,
Long
>
> > apc->tx_queue_size = DEF_TX_BUFFERS_PER_QUEUE;
> > apc->rx_queue_size = DEF_RX_BUFFERS_PER_QUEUE;
> > apc->port_handle = INVALID_MANA_HANDLE;
>
> ...
^ permalink raw reply
* [PPATCH net v3] net: mana: fix use-after-free in add_adev() error path
From: Guangshuo Li @ 2026-03-23 16:57 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Erni Sri Satya Vennela, Dipayaan Roy, Aditya Garg,
Shiraz Saleem, Kees Cook, Leon Romanovsky, linux-hyperv, netdev,
linux-kernel
Cc: Guangshuo Li, stable
If auxiliary_device_add() fails, add_adev() jumps to add_fail and calls
auxiliary_device_uninit(adev).
The auxiliary device has its release callback set to adev_release(),
which frees the containing struct mana_adev. Since adev is embedded in
struct mana_adev, the subsequent fall-through to init_fail and access
to adev->id may result in a use-after-free.
Fix this by saving the allocated auxiliary device id in a local
variable before calling auxiliary_device_add(), and use that saved id
in the cleanup path after auxiliary_device_uninit().
Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")
Cc: stable@vger.kernel.org
Reviewed-by: Long Li <longli@microsoft.com>
Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
---
v2:
- explain the UAF in more detail
- retarget to net
- preserve reverse xmas tree order for local variables
v3:
- rebase onto the current net tree
drivers/net/ethernet/microsoft/mana/mana_en.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 9017e806ecda..d03f42245ab8 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -3424,6 +3424,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
{
struct auxiliary_device *adev;
struct mana_adev *madev;
+ int id;
int ret;
madev = kzalloc_obj(*madev);
@@ -3434,7 +3435,8 @@ static int add_adev(struct gdma_dev *gd, const char *name)
ret = mana_adev_idx_alloc();
if (ret < 0)
goto idx_fail;
- adev->id = ret;
+ id = ret;
+ adev->id = id;
adev->name = name;
adev->dev.parent = gd->gdma_context->dev;
@@ -3460,7 +3462,7 @@ static int add_adev(struct gdma_dev *gd, const char *name)
auxiliary_device_uninit(adev);
init_fail:
- mana_adev_idx_free(adev->id);
+ mana_adev_idx_free(id);
idx_fail:
kfree(madev);
--
2.43.0
^ permalink raw reply related
* Re: [PATCH] mshv: Fix error handling in mshv_region_populate_pages
From: Stanislav Kinsburskii @ 2026-03-23 16:09 UTC (permalink / raw)
To: Wei Liu
Cc: Michael Kelley, kys@microsoft.com, haiyangz@microsoft.com,
decui@microsoft.com, longli@microsoft.com,
linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <20260318162003.GB262287@liuwe-devbox-debian-v2.local>
On Wed, Mar 18, 2026 at 04:20:03PM +0000, Wei Liu wrote:
> On Wed, Mar 18, 2026 at 02:38:49PM +0000, Michael Kelley wrote:
> > From: Wei Liu <wei.liu@kernel.org> Sent: Tuesday, March 17, 2026 11:20 PM
> > >
> > > On Tue, Mar 17, 2026 at 09:56:07PM +0000, Michael Kelley wrote:
> > > > From: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com> Sent: Tuesday, March 17, 2026 8:05 AM
> > > > >
> > > > > The current error handling has two issues:
> > > > >
> > > > > First, pin_user_pages_fast() can return a short pin count (less than
> > > > > requested but greater than zero) when it cannot pin all requested pages.
> > > > > This is treated as success, leading to partially pinned regions being
> > > > > used, which causes memory corruption.
> > > > >
> > > > > Second, when an error occurs mid-loop, already pinned pages from the
> > > > > current batch are not released before calling mshv_region_evict_pages(),
> > > > > causing a page reference leak.
> > > >
> > > > There's now an online LLM-based tool that is automatically reviewing
> > > > kernel patches. For this patch, the results are here:
> > > >
> > > >
> > > https://sashiko.dev/#/patchset/177375989324.25621.6532741522672582851.stgit
> > > %40skinsburskii-cloud-desktop.internal.cloudapp.net
> > > >
> > > > It has flagged the commit message as incorrectly referencing the
> > > > function mshv_region_evict_pages(), which doesn't exist.
> > > >
> > > > FWIW, the announcement about sashiko.dev is here:
> > > >
> > > > https://lore.kernel.org/lkml/7ia4o6kmpj5s.fsf@castle.c.googlers.com/
> > > >
> > > > Other than the commit message reference, this looks good to me.
> > > >
> > > > Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> > >
> > > The second point is written as if the code here should release the
> > > already pinned pages before calling mshv_region_invalidate_pages(), but
> > > the code actually relies on mshv_mem_region_invalidate_pages() to
> > > release the pages. The change here fixes the accounting.
> > >
> > > Second, when an error occurs mid-loop, already pinned pages from the
> > > current batch are not accounted for before calling
> > > mshv_region_invalidate_pages(), causing a page reference leak.
> > >
> > > And queued up the patch to hyperv-fixes.
> >
> > One other thing I noticed: The "Subject" of the patch is wrong. It
> > mentions mshv_region_populate_pages(), but the function being
> > modified is actually mshv_region_pin().
>
> Good catch. I have updated the subject line and pushed to hyperv-fixes.
>
Thank you Michael and Wei.
Thanks,
Stanislav
> Wei
>
> >
> > Michael
> >
> > >
> > > Wei
> > >
> > > >
> > > > >
> > > > > Fix by treating short pins as errors and explicitly unpinning the
> > > > > partial batch before cleanup.
> > > > >
> > > > > Signed-off-by: Stanislav Kinsburskii <skinsburskii@linux.microsoft.com>
> > > > > ---
> > > > > drivers/hv/mshv_regions.c | 6 ++++--
> > > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > > > > index c28aac0726de..fdffd4f002f6 100644
> > > > > --- a/drivers/hv/mshv_regions.c
> > > > > +++ b/drivers/hv/mshv_regions.c
> > > > > @@ -314,15 +314,17 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > > > > ret = pin_user_pages_fast(userspace_addr, nr_pages,
> > > > > FOLL_WRITE | FOLL_LONGTERM,
> > > > > pages);
> > > > > - if (ret < 0)
> > > > > + if (ret != nr_pages)
> > > > > goto release_pages;
> > > > > }
> > > > >
> > > > > return 0;
> > > > >
> > > > > release_pages:
> > > > > + if (ret > 0)
> > > > > + done_count += ret;
> > > > > mshv_region_invalidate_pages(region, 0, done_count);
> > > > > - return ret;
> > > > > + return ret < 0 ? ret : -ENOMEM;
> > > > > }
> > > > >
> > > > > static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > > > >
> > > > >
> > > >
> >
^ permalink raw reply
* Re: [PATCH] hv_sock: update outdated comment for renamed vsock_stream_recvmsg()
From: Simon Horman @ 2026-03-23 15:26 UTC (permalink / raw)
To: Kexin Sun
Cc: kys, haiyangz, wei.liu, decui, longli, sgarzare, davem, edumazet,
kuba, pabeni, linux-hyperv, virtualization, netdev, linux-kernel,
julia.lawall, xutong.ma, yunbolyu, ratnadiraw
In-Reply-To: <20260321105753.6751-1-kexinsun@smail.nju.edu.cn>
On Sat, Mar 21, 2026 at 06:57:53PM +0800, Kexin Sun wrote:
> The function vsock_stream_recvmsg() was renamed to
> vsock_connectible_recvmsg() by commit a9e29e5511b9 ("af_vsock:
> update functions for connectible socket"). Update the comment
> accordingly.
>
> Assisted-by: unnamed:deepseek-v3.2 coccinelle
> Signed-off-by: Kexin Sun <kexinsun@smail.nju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
^ permalink raw reply
* Re: [PATCH net v2] net: mana: fix use-after-free in add_adev() error path
From: Simon Horman @ 2026-03-23 14:26 UTC (permalink / raw)
To: Guangshuo Li
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Saurabh Sengar, Erni Sri Satya Vennela,
Shradha Gupta, Aditya Garg, Dipayaan Roy, Shiraz Saleem,
Leon Romanovsky, linux-hyperv, netdev, linux-kernel, stable
In-Reply-To: <20260321053918.791068-1-lgs201920130244@gmail.com>
On Sat, Mar 21, 2026 at 01:39:18PM +0800, Guangshuo Li wrote:
> If auxiliary_device_add() fails, add_adev() jumps to add_fail and calls
> auxiliary_device_uninit(adev).
>
> The auxiliary device has its release callback set to adev_release(),
> which frees the containing struct mana_adev. Since adev is embedded in
> struct mana_adev, the subsequent fall-through to init_fail and access
> to adev->id may result in a use-after-free.
>
> Fix this by saving the allocated auxiliary device id in a local
> variable before calling auxiliary_device_add(), and use that saved id
> in the cleanup path after auxiliary_device_uninit().
>
> Fixes: a69839d4327d ("net: mana: Add support for auxiliary device")
> Cc: stable@vger.kernel.org
> Reviewed-by: Long Li <longli@microsoft.com>
> Signed-off-by: Guangshuo Li <lgs201920130244@gmail.com>
> ---
> v2:
> - explain the UAF in more detail
> - retarget to net
> - preserve reverse xmas tree order for local variables
Thanks for the update.
Unfortunately the patch doesn't apply cleanly against net,
which breaks our CI.
Please rebase and repost.
--
pw-bot: changes-requested
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox