* [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core
@ 2026-01-22 14:52 Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
` (3 more replies)
0 siblings, 4 replies; 8+ messages in thread
From: Jinhui Guo @ 2026-01-22 14:52 UTC (permalink / raw)
To: dakr, alexanderduyck, bhelgaas, bvanassche, dan.j.williams,
gregkh, helgaas, rafael, tj, frederic
Cc: guojinhui.liam, linux-kernel, linux-pci
Hi all,
** Overview **
This patchset introduces NUMA-node-aware synchronous probing.
Drivers can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout the code.
NUMA-aware probing was added to PCI drivers in 2005 and has
benefited them ever since.
The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since NUMA
affinity is orthogonal to sync/async probing, this patchset adds
NUMA-node-aware support to the synchronous probe path.
** Background **
The idea arose from a discussion with Bjorn and Danilo about a
PCI-probe issue [1]:
when PCI devices on the same NUMA node are probed asynchronously,
pci_call_probe() calls work_on_cpu(), pins every probe worker to
the same CPU inside that node, and forces the probes to run serially.
Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor (all on CPU 0):
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
Since the driver core already provides NUMA-node-aware asynchronous
probing, we can extend the same capability to the synchronous probe
path. This solves the issue and lets other drivers benefit from
NUMA-local initialization as well.
[1] https://lore.kernel.org/all/20251227113326.964-1-guojinhui.liam@bytedance.com/
** Changes **
The series makes three main changes:
1. Adds helper __device_attach_driver_scan() to eliminate duplication
between __device_attach() and __device_attach_async_helper().
2. Introduce helper __driver_probe_device_node() and use it to enable
NUMA-local synchronous probing in __device_attach(), device_driver_attach(),
and __driver_attach().
3. Removes the now-redundant NUMA code from the PCI driver.
** Test **
I added debug prints to nvme, mlx5, usbhid, and intel_rapl_msr and
ran tests on an AMD EPYC 9A64 system (Updated test results for the
new patchset are provided below):
NUMA topology of the test machine:
# lscpu |grep NUMA
NUMA node(s): 2
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
1. Without the patchset
- PCI drivers (nvme, mlx5) probe sequentially on CPU 0
- USB and platform drivers pick random CPUs in the udev worker
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 54013202 ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, cost: 53968911 ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:4, cost: 48077276 ns
mlx5_core 0000:41:00.0: CPU: 0, COMM: kworker/0:2 cost: 506256717 ns
mlx5_core 0000:41:00.1: CPU: 0, COMM: kworker/0:2 cost: 514289394 ns
usb 1-2.4: CPU: 163, COMM: (udev-worker), cost 854131 ns
usb 1-2.6: CPU: 163, COMM: (udev-worker), cost 967993 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 61, COMM: (udev-worker), cost: 3717567 ns
2. With the patchset
- PCI probes are spread across CPUs inside the device’s NUMA node
- Asynchronous nvme probes are ~35 % faster; synchronous mlx5 times
are unchanged
- USB probe times are virtually identical
- Platform driver (no NUMA node) falls back to the original path
nvme 0000:01:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 34244206 ns
nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, cost: 33883391 ns
nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, cost: 33943040 ns
mlx5_core 0000:41:00.0: CPU: 3, COMM: kworker/u1025:1, cost: 507206174 ns
mlx5_core 0000:41:00.1: CPU: 3, COMM: kworker/u1025:1, cost: 514927642 ns
usb 1-2.4: CPU: 4, COMM: kworker/u1025:8, cost: 991417 ns
usb 1-2.6: CPU: 2, COMM: kworker/u1025:5, cost: 935112 ns
intel_rapl_msr intel_rapl_msr.0: CPU: 17, COMM: (udev-worker), cost: 4849967 ns
3. With the patchset, unbind/bind cycles also spread PCI probes across
CPUs within the device’s NUMA node:
nvme 0000:02:00.0: CPU: 130, COMM: kworker/u1025:4, cost: 35086209 ns
** Final **
Comments and suggestions are welcome.
Best Regards,
Jinhui
---
v1: https://lore.kernel.org/all/20260107175548.1792-1-guojinhui.liam@bytedance.com/
Changelog in v1 -> v2:
- Reword the first patch’s commit message for accuracy and add
Reviewed-by tags; no code changes.
- Refactor the second patch to reduce complexity: introduce
__driver_probe_device_node() and update the signature of
driver_probe_device() to support NUMA-node-aware synchronous
probing. (suggested by Danilo)
- The third patch resolves conflicts with three patches from
patchset [2] that have since been merged into linux-next.git.
- Update the test data in the cover letter for the new patchset.
[2] https://lore.kernel.org/all/20260101221359.22298-1-frederic@kernel.org/
Jinhui Guo (3):
driver core: Introduce helper function __device_attach_driver_scan()
driver core: Add NUMA-node awareness to the synchronous probe path
PCI: Clean up NUMA-node awareness in pci_bus_type probe
drivers/base/dd.c | 147 +++++++++++++++++++++++++++++----------
drivers/pci/pci-driver.c | 116 +++---------------------------
include/linux/pci.h | 4 --
kernel/sched/isolation.c | 2 -
4 files changed, 118 insertions(+), 151 deletions(-)
--
2.20.1
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 1/3] driver core: Introduce helper function __device_attach_driver_scan()
2026-01-22 14:52 [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
@ 2026-01-22 14:52 ` Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: Jinhui Guo @ 2026-01-22 14:52 UTC (permalink / raw)
To: dakr, alexanderduyck, bhelgaas, bvanassche, dan.j.williams,
gregkh, helgaas, rafael, tj, frederic
Cc: guojinhui.liam, linux-kernel, linux-pci
The logic responsible for managing parent device PM
runtime (get/put), iterating over the bus drivers,and
determining if asynchronous probing is required is
currently duplicated between __device_attach() and
__device_attach_async_helper().
This patch factors out this common logic into a new
static helper function __device_attach_driver_scan().
This change reduces code duplication and improves
maintainability without altering the existing behavior.
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
Reviewed-by: Danilo Krummrich <dakr@kernel.org>
---
drivers/base/dd.c | 71 ++++++++++++++++++++++++++---------------------
1 file changed, 40 insertions(+), 31 deletions(-)
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index ed3a07624816..b6be95871d3d 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -964,6 +964,44 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
return ret == 0;
}
+static int __device_attach_driver_scan(struct device_attach_data *data,
+ bool *need_async)
+{
+ int ret = 0;
+ struct device *dev = data->dev;
+
+ if (dev->parent)
+ pm_runtime_get_sync(dev->parent);
+
+ ret = bus_for_each_drv(dev->bus, NULL, data,
+ __device_attach_driver);
+ /*
+ * When running in an async worker, a NULL need_async is passed
+ * since we are already in an async worker.
+ */
+ if (need_async && !ret && data->check_async && data->have_async) {
+ /*
+ * If we could not find appropriate driver
+ * synchronously and we are allowed to do
+ * async probes and there are drivers that
+ * want to probe asynchronously, we'll
+ * try them.
+ */
+ dev_dbg(dev, "scheduling asynchronous probe\n");
+ get_device(dev);
+ *need_async = true;
+ } else {
+ if (!need_async)
+ dev_dbg(dev, "async probe completed\n");
+ pm_request_idle(dev);
+ }
+
+ if (dev->parent)
+ pm_runtime_put(dev->parent);
+
+ return ret;
+}
+
static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
{
struct device *dev = _dev;
@@ -984,16 +1022,8 @@ static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
if (dev->p->dead || dev->driver)
goto out_unlock;
- if (dev->parent)
- pm_runtime_get_sync(dev->parent);
+ __device_attach_driver_scan(&data, NULL);
- bus_for_each_drv(dev->bus, NULL, &data, __device_attach_driver);
- dev_dbg(dev, "async probe completed\n");
-
- pm_request_idle(dev);
-
- if (dev->parent)
- pm_runtime_put(dev->parent);
out_unlock:
device_unlock(dev);
@@ -1027,28 +1057,7 @@ static int __device_attach(struct device *dev, bool allow_async)
.want_async = false,
};
- if (dev->parent)
- pm_runtime_get_sync(dev->parent);
-
- ret = bus_for_each_drv(dev->bus, NULL, &data,
- __device_attach_driver);
- if (!ret && allow_async && data.have_async) {
- /*
- * If we could not find appropriate driver
- * synchronously and we are allowed to do
- * async probes and there are drivers that
- * want to probe asynchronously, we'll
- * try them.
- */
- dev_dbg(dev, "scheduling asynchronous probe\n");
- get_device(dev);
- async = true;
- } else {
- pm_request_idle(dev);
- }
-
- if (dev->parent)
- pm_runtime_put(dev->parent);
+ ret = __device_attach_driver_scan(&data, &async);
}
out_unlock:
device_unlock(dev);
--
2.20.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 2/3] driver core: Add NUMA-node awareness to the synchronous probe path
2026-01-22 14:52 [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
@ 2026-01-22 14:52 ` Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe Jinhui Guo
2026-01-24 1:04 ` [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core dan.j.williams
3 siblings, 0 replies; 8+ messages in thread
From: Jinhui Guo @ 2026-01-22 14:52 UTC (permalink / raw)
To: dakr, alexanderduyck, bhelgaas, bvanassche, dan.j.williams,
gregkh, helgaas, rafael, tj, frederic
Cc: guojinhui.liam, linux-kernel, linux-pci
Introduce NUMA-node-aware synchronous probing: drivers
can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout
the code.
NUMA-aware probing was first added to PCI drivers by
commit d42c69972b85 ("[PATCH] PCI: Run PCI driver
initialization on local node") in 2005 and has benefited
PCI drivers ever since.
The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since
NUMA affinity is orthogonal to sync/async probing, this
patch adds NUMA-node-aware support to the synchronous
probe path.
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
drivers/base/dd.c | 76 +++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 70 insertions(+), 6 deletions(-)
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index b6be95871d3d..a8d560034abe 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -810,10 +810,56 @@ static int __driver_probe_device(const struct device_driver *drv, struct device
return ret;
}
+/* Context for NUMA execution */
+struct numa_work_ctx {
+ struct work_struct work;
+ const struct device_driver *drv;
+ struct device *dev;
+ int result;
+};
+
+/* Worker function running on the target node */
+static void __driver_probe_device_node_helper(struct work_struct *work)
+{
+ struct numa_work_ctx *ctx = container_of(work, struct numa_work_ctx, work);
+
+ ctx->result = __driver_probe_device(ctx->drv, ctx->dev);
+}
+
+/*
+ * __driver_probe_device_node - execute __driver_probe_device on a specific NUMA node synchronously
+ * @drv: driver to bind a device to
+ * @dev: device to try to bind to the driver
+ *
+ * Returns the result of the function execution, or -ENODEV if initialization fails.
+ * If the node is invalid or offline, it falls back to local execution.
+ */
+static int __driver_probe_device_node(const struct device_driver *drv, struct device *dev)
+{
+ struct numa_work_ctx ctx;
+ int node = dev_to_node(dev);
+
+ if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+ return __driver_probe_device(drv, dev);
+
+ ctx.drv = drv;
+ ctx.dev = dev;
+ ctx.result = -ENODEV;
+ INIT_WORK_ONSTACK(&ctx.work, __driver_probe_device_node_helper);
+
+ /* Use system_dfl_wq to allow execution on the specific node. */
+ queue_work_node(node, system_dfl_wq, &ctx.work);
+ flush_work(&ctx.work);
+ destroy_work_on_stack(&ctx.work);
+
+ return ctx.result;
+}
+
/**
* driver_probe_device - attempt to bind device & driver together
* @drv: driver to bind a device to
* @dev: device to try to bind to the driver
+ * @in_async: true if the caller is running in an asynchronous worker context
*
* This function returns -ENODEV if the device is not registered, -EBUSY if it
* already has a driver, 0 if the device is bound successfully and a positive
@@ -824,13 +870,22 @@ static int __driver_probe_device(const struct device_driver *drv, struct device
*
* If the device has a parent, runtime-resume the parent before driver probing.
*/
-static int driver_probe_device(const struct device_driver *drv, struct device *dev)
+static int driver_probe_device(const struct device_driver *drv, struct device *dev, bool in_async)
{
int trigger_count = atomic_read(&deferred_trigger_count);
int ret;
atomic_inc(&probe_count);
- ret = __driver_probe_device(drv, dev);
+ /*
+ * If we are already in an asynchronous worker, invoke __driver_probe_device()
+ * directly to avoid the overhead of an additional workqueue scheduling.
+ * The async subsystem manages its own concurrency and placement.
+ */
+ if (in_async)
+ ret = __driver_probe_device(drv, dev);
+ else
+ ret = __driver_probe_device_node(drv, dev);
+
if (ret == -EPROBE_DEFER || ret == EPROBE_DEFER) {
driver_deferred_probe_add(dev);
@@ -919,6 +974,13 @@ struct device_attach_data {
* driver, we'll encounter one that requests asynchronous probing.
*/
bool have_async;
+
+ /*
+ * True when running inside an asynchronous worker context scheduled
+ * by async_schedule_dev() with callback function
+ * __device_attach_async_helper().
+ */
+ bool in_async;
};
static int __device_attach_driver(struct device_driver *drv, void *_data)
@@ -958,7 +1020,7 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
* Ignore errors returned by ->probe so that the next driver can try
* its luck.
*/
- ret = driver_probe_device(drv, dev);
+ ret = driver_probe_device(drv, dev, data->in_async);
if (ret < 0)
return ret;
return ret == 0;
@@ -1009,6 +1071,7 @@ static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
.dev = dev,
.check_async = true,
.want_async = true,
+ .in_async = true,
};
device_lock(dev);
@@ -1055,6 +1118,7 @@ static int __device_attach(struct device *dev, bool allow_async)
.dev = dev,
.check_async = allow_async,
.want_async = false,
+ .in_async = false,
};
ret = __device_attach_driver_scan(&data, &async);
@@ -1144,7 +1208,7 @@ int device_driver_attach(const struct device_driver *drv, struct device *dev)
int ret;
__device_driver_lock(dev, dev->parent);
- ret = __driver_probe_device(drv, dev);
+ ret = __driver_probe_device_node(drv, dev);
__device_driver_unlock(dev, dev->parent);
/* also return probe errors as normal negative errnos */
@@ -1165,7 +1229,7 @@ static void __driver_attach_async_helper(void *_dev, async_cookie_t cookie)
__device_driver_lock(dev, dev->parent);
drv = dev->p->async_driver;
dev->p->async_driver = NULL;
- ret = driver_probe_device(drv, dev);
+ ret = driver_probe_device(drv, dev, true);
__device_driver_unlock(dev, dev->parent);
dev_dbg(dev, "driver %s async attach completed: %d\n", drv->name, ret);
@@ -1233,7 +1297,7 @@ static int __driver_attach(struct device *dev, void *data)
}
__device_driver_lock(dev, dev->parent);
- driver_probe_device(drv, dev);
+ driver_probe_device(drv, dev, false);
__device_driver_unlock(dev, dev->parent);
return 0;
--
2.20.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe
2026-01-22 14:52 [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
@ 2026-01-22 14:52 ` Jinhui Guo
2026-01-24 1:04 ` [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core dan.j.williams
3 siblings, 0 replies; 8+ messages in thread
From: Jinhui Guo @ 2026-01-22 14:52 UTC (permalink / raw)
To: dakr, alexanderduyck, bhelgaas, bvanassche, dan.j.williams,
gregkh, helgaas, rafael, tj, frederic
Cc: guojinhui.liam, linux-kernel, linux-pci
With NUMA-node-aware probing now handled by the driver core,
the equivalent code in the PCI driver is redundant and can
be removed.
Dropping it speeds up asynchronous probe by 35%; the gain
comes from eliminating the work_on_cpu() call in pci_call_probe()
that previously pinned every worker to the same CPU, forcing
serial probe of devices on the same NUMA node.
Testing three NVMe devices on the same NUMA node of an AMD
EPYC 9A64 2.4 GHz processor shows a 35% probe-time improvement
with the patch:
Before (all on CPU 0):
nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 52266334ns
nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:0, cost: 50787194ns
nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:2, cost: 50541584ns
After (spread across CPUs 1, 2, 4):
nvme 0000:01:00.0: CPU: 1, COMM: kworker/u1025:2, cost: 35399608ns
nvme 0000:02:00.0: CPU: 2, COMM: kworker/u1025:3, cost: 35156157ns
nvme 0000:03:00.0: CPU: 4, COMM: kworker/u1025:0, cost: 35322116ns
The improvement grows with more PCI devices because fewer probes
contend for the same CPU.
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
drivers/pci/pci-driver.c | 116 +++------------------------------------
include/linux/pci.h | 4 --
kernel/sched/isolation.c | 2 -
3 files changed, 8 insertions(+), 114 deletions(-)
diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 6b80400ee9b9..258f16da6550 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -296,17 +296,9 @@ static struct attribute *pci_drv_attrs[] = {
};
ATTRIBUTE_GROUPS(pci_drv);
-struct drv_dev_and_id {
- struct pci_driver *drv;
- struct pci_dev *dev;
- const struct pci_device_id *id;
-};
-
-static int local_pci_probe(struct drv_dev_and_id *ddi)
+static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
+ const struct pci_device_id *id)
{
- struct pci_dev *pci_dev = ddi->dev;
- struct pci_driver *pci_drv = ddi->drv;
- struct device *dev = &pci_dev->dev;
int rc;
/*
@@ -318,113 +310,25 @@ static int local_pci_probe(struct drv_dev_and_id *ddi)
* count, in its probe routine and pm_runtime_get_noresume() in
* its remove routine.
*/
- pm_runtime_get_sync(dev);
- pci_dev->driver = pci_drv;
- rc = pci_drv->probe(pci_dev, ddi->id);
+ pm_runtime_get_sync(&dev->dev);
+ dev->driver = drv;
+ rc = drv->probe(dev, id);
if (!rc)
return rc;
if (rc < 0) {
- pci_dev->driver = NULL;
- pm_runtime_put_sync(dev);
+ dev->driver = NULL;
+ pm_runtime_put_sync(&dev->dev);
return rc;
}
/*
* Probe function should return < 0 for failure, 0 for success
* Treat values > 0 as success, but warn.
*/
- pci_warn(pci_dev, "Driver probe function unexpectedly returned %d\n",
+ pci_warn(dev, "Driver probe function unexpectedly returned %d\n",
rc);
return 0;
}
-static struct workqueue_struct *pci_probe_wq;
-
-struct pci_probe_arg {
- struct drv_dev_and_id *ddi;
- struct work_struct work;
- int ret;
-};
-
-static void local_pci_probe_callback(struct work_struct *work)
-{
- struct pci_probe_arg *arg = container_of(work, struct pci_probe_arg, work);
-
- arg->ret = local_pci_probe(arg->ddi);
-}
-
-static bool pci_physfn_is_probed(struct pci_dev *dev)
-{
-#ifdef CONFIG_PCI_IOV
- return dev->is_virtfn && dev->physfn->is_probed;
-#else
- return false;
-#endif
-}
-
-static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
- const struct pci_device_id *id)
-{
- int error, node, cpu;
- struct drv_dev_and_id ddi = { drv, dev, id };
-
- /*
- * Execute driver initialization on node where the device is
- * attached. This way the driver likely allocates its local memory
- * on the right node.
- */
- node = dev_to_node(&dev->dev);
- dev->is_probed = 1;
-
- cpu_hotplug_disable();
- /*
- * Prevent nesting work_on_cpu() for the case where a Virtual Function
- * device is probed from work_on_cpu() of the Physical device.
- */
- if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
- pci_physfn_is_probed(dev)) {
- error = local_pci_probe(&ddi);
- } else {
- struct pci_probe_arg arg = { .ddi = &ddi };
-
- INIT_WORK_ONSTACK(&arg.work, local_pci_probe_callback);
- /*
- * The target election and the enqueue of the work must be within
- * the same RCU read side section so that when the workqueue pool
- * is flushed after a housekeeping cpumask update, further readers
- * are guaranteed to queue the probing work to the appropriate
- * targets.
- */
- rcu_read_lock();
- cpu = cpumask_any_and(cpumask_of_node(node),
- housekeeping_cpumask(HK_TYPE_DOMAIN));
-
- if (cpu < nr_cpu_ids) {
- struct workqueue_struct *wq = pci_probe_wq;
-
- if (WARN_ON_ONCE(!wq))
- wq = system_percpu_wq;
- queue_work_on(cpu, wq, &arg.work);
- rcu_read_unlock();
- flush_work(&arg.work);
- error = arg.ret;
- } else {
- rcu_read_unlock();
- error = local_pci_probe(&ddi);
- }
-
- destroy_work_on_stack(&arg.work);
- }
-
- dev->is_probed = 0;
- cpu_hotplug_enable();
- return error;
-}
-
-void pci_probe_flush_workqueue(void)
-{
- flush_workqueue(pci_probe_wq);
-}
-
/**
* __pci_device_probe - check if a driver wants to claim a specific PCI device
* @drv: driver to call to check if it wants the PCI device
@@ -1734,10 +1638,6 @@ static int __init pci_driver_init(void)
{
int ret;
- pci_probe_wq = alloc_workqueue("sync_wq", WQ_PERCPU, 0);
- if (!pci_probe_wq)
- return -ENOMEM;
-
ret = bus_register(&pci_bus_type);
if (ret)
return ret;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 7e36936bb37a..ae05faa105e2 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -486,7 +486,6 @@ struct pci_dev {
unsigned int io_window_1k:1; /* Intel bridge 1K I/O windows */
unsigned int irq_managed:1;
unsigned int non_compliant_bars:1; /* Broken BARs; ignore them */
- unsigned int is_probed:1; /* Device probing in progress */
unsigned int link_active_reporting:1;/* Device capable of reporting link active */
unsigned int no_vf_scan:1; /* Don't scan for VFs after IOV enablement */
unsigned int no_command_memory:1; /* No PCI_COMMAND_MEMORY */
@@ -1211,7 +1210,6 @@ struct pci_bus *pci_create_root_bus(struct device *parent, int bus,
struct pci_ops *ops, void *sysdata,
struct list_head *resources);
int pci_host_probe(struct pci_host_bridge *bridge);
-void pci_probe_flush_workqueue(void);
int pci_bus_insert_busn_res(struct pci_bus *b, int bus, int busmax);
int pci_bus_update_busn_res_end(struct pci_bus *b, int busmax);
void pci_bus_release_busn_res(struct pci_bus *b);
@@ -2085,8 +2083,6 @@ static inline int pci_has_flag(int flag) { return 0; }
_PCI_NOP_ALL(read, *)
_PCI_NOP_ALL(write,)
-static inline void pci_probe_flush_workqueue(void) { }
-
static inline struct pci_dev *pci_get_device(unsigned int vendor,
unsigned int device,
struct pci_dev *from)
diff --git a/kernel/sched/isolation.c b/kernel/sched/isolation.c
index ef152d401fe2..3d28d8163ee4 100644
--- a/kernel/sched/isolation.c
+++ b/kernel/sched/isolation.c
@@ -8,7 +8,6 @@
*
*/
#include <linux/sched/isolation.h>
-#include <linux/pci.h>
#include "sched.h"
enum hk_flags {
@@ -144,7 +143,6 @@ int housekeeping_update(struct cpumask *isol_mask)
synchronize_rcu();
- pci_probe_flush_workqueue();
mem_cgroup_flush_workqueue();
vmstat_flush_workqueue();
--
2.20.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core
2026-01-22 14:52 [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
` (2 preceding siblings ...)
2026-01-22 14:52 ` [PATCH v2 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe Jinhui Guo
@ 2026-01-24 1:04 ` dan.j.williams
2026-01-26 9:17 ` Jinhui Guo
3 siblings, 1 reply; 8+ messages in thread
From: dan.j.williams @ 2026-01-24 1:04 UTC (permalink / raw)
To: Jinhui Guo, dakr, alexanderduyck, bhelgaas, bvanassche,
dan.j.williams, gregkh, helgaas, rafael, tj, frederic
Cc: guojinhui.liam, linux-kernel, linux-pci
Jinhui Guo wrote:
> Hi all,
>
> ** Overview **
>
> This patchset introduces NUMA-node-aware synchronous probing.
>
> Drivers can initialize and allocate memory on the device’s local
> node without scattering kmalloc_node() calls throughout the code.
> NUMA-aware probing was added to PCI drivers in 2005 and has
> benefited them ever since.
>
> The asynchronous probe path already supports NUMA-node-aware
> probing via async_schedule_dev() in the driver core. Since NUMA
> affinity is orthogonal to sync/async probing, this patchset adds
> NUMA-node-aware support to the synchronous probe path.
>
> ** Background **
>
> The idea arose from a discussion with Bjorn and Danilo about a
> PCI-probe issue [1]:
>
> when PCI devices on the same NUMA node are probed asynchronously,
> pci_call_probe() calls work_on_cpu(), pins every probe worker to
> the same CPU inside that node, and forces the probes to run serially.
>
> Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
> 2.4 GHz processor (all on CPU 0):
>
> nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
> nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
> nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
>
> Since the driver core already provides NUMA-node-aware asynchronous
> probing, we can extend the same capability to the synchronous probe
> path. This solves the issue and lets other drivers benefit from
> NUMA-local initialization as well.
I like that from a global benefit perspective, but not necessarily from
a regression perspective. Is there a minimal fix to PCI to make its
current workqueue unbound, then if that goes well come back and move all
devices into this scheme?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core
2026-01-24 1:04 ` [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core dan.j.williams
@ 2026-01-26 9:17 ` Jinhui Guo
2026-01-27 2:32 ` Jinhui Guo
2026-01-27 3:37 ` dan.j.williams
0 siblings, 2 replies; 8+ messages in thread
From: Jinhui Guo @ 2026-01-26 9:17 UTC (permalink / raw)
To: dan.j.williams
Cc: alexanderduyck, bhelgaas, bvanassche, dakr, frederic, gregkh,
guojinhui.liam, helgaas, linux-kernel, linux-pci, rafael, tj
On Fri Jan 23, 2026 17:04:27 -0800, Dan Williams wrote:
> Jinhui Guo wrote:
> > Hi all,
> >
> > ** Overview **
> >
> > This patchset introduces NUMA-node-aware synchronous probing.
> >
> > Drivers can initialize and allocate memory on the device’s local
> > node without scattering kmalloc_node() calls throughout the code.
> > NUMA-aware probing was added to PCI drivers in 2005 and has
> > benefited them ever since.
> >
> > The asynchronous probe path already supports NUMA-node-aware
> > probing via async_schedule_dev() in the driver core. Since NUMA
> > affinity is orthogonal to sync/async probing, this patchset adds
> > NUMA-node-aware support to the synchronous probe path.
> >
> > ** Background **
> >
> > The idea arose from a discussion with Bjorn and Danilo about a
> > PCI-probe issue [1]:
> >
> > when PCI devices on the same NUMA node are probed asynchronously,
> > pci_call_probe() calls work_on_cpu(), pins every probe worker to
> > the same CPU inside that node, and forces the probes to run serially.
> >
> > Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
> > 2.4 GHz processor (all on CPU 0):
> >
> > nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
> > nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
> > nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
> >
> > Since the driver core already provides NUMA-node-aware asynchronous
> > probing, we can extend the same capability to the synchronous probe
> > path. This solves the issue and lets other drivers benefit from
> > NUMA-local initialization as well.
>
> I like that from a global benefit perspective, but not necessarily from
> a regression perspective. Is there a minimal fix to PCI to make its
> current workqueue unbound, then if that goes well come back and move all
> devices into this scheme?
Hi Dan,
Thank you for your time, and apologies for the delayed reply.
I understand your concerns about stability and hope for better PCI regression
handling. However, I believe introducing NUMA-node awareness to the driver
core's asynchronous probe path is the better solution:
1. The asynchronous path already uses async_schedule_dev() with queue_work_node()
to bind workers to specific NUMA nodes—this causes no side effects to driver
probing.
2. I initially submitted a PCI-only fix [1], but handling asynchronous probing in
PCI driver proved difficult. Using current_is_async() works but feels fragile.
After discussions with Bjorn and Danilo [2][3], moving the solution to driver
core makes distinguishing async/sync probing straightforward. Testing shows
minimal impact on synchronous probe time.
3. If you prefer a PCI-only approach, we could add a flag in struct device_driver
(default false) that PCI sets during registration. This limits the new path to
PCI devices while others retain existing behavior. The extra code is ~10 lines
and can be removed once confidence is established.
4. I'm committed to supporting this: I'll include "Fixes:" tags for any fallout
and provide patches within a month of any report. Since the logic mirrors the
core async helper, risk should be low—but I'll take full responsibility
regardless.
Please let me know if you have other concerns.
[1] https://lore.kernel.org/all/20251230142736.1168-1-guojinhui.liam@bytedance.com/
[2] https://lore.kernel.org/all/20251231165503.GA159243@bhelgaas/
[3] https://lore.kernel.org/all/DFFXIZR1AGTV.2WZ1G2JAU0HFQ@kernel.org/
Best Regards,
Jinhui
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core
2026-01-26 9:17 ` Jinhui Guo
@ 2026-01-27 2:32 ` Jinhui Guo
2026-01-27 3:37 ` dan.j.williams
1 sibling, 0 replies; 8+ messages in thread
From: Jinhui Guo @ 2026-01-27 2:32 UTC (permalink / raw)
To: guojinhui.liam
Cc: alexanderduyck, bhelgaas, bvanassche, dakr, dan.j.williams,
frederic, gregkh, helgaas, linux-kernel, linux-pci, rafael, tj
On Mon Jan 26, 2026 17:17:49 -0800, Jinhui Guo wrote:
> Hi Dan,
>
> Thank you for your time, and apologies for the delayed reply.
>
> I understand your concerns about stability and hope for better PCI regression
> handling. However, I believe introducing NUMA-node awareness to the driver
> core's asynchronous probe path is the better solution:
"asynchronous probe path" -> "synchronous probe path"
Apologies for the typo.
Best Regards,
Jinhui
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core
2026-01-26 9:17 ` Jinhui Guo
2026-01-27 2:32 ` Jinhui Guo
@ 2026-01-27 3:37 ` dan.j.williams
1 sibling, 0 replies; 8+ messages in thread
From: dan.j.williams @ 2026-01-27 3:37 UTC (permalink / raw)
To: Jinhui Guo, dan.j.williams
Cc: alexanderduyck, bhelgaas, bvanassche, dakr, frederic, gregkh,
guojinhui.liam, helgaas, linux-kernel, linux-pci, rafael, tj
Jinhui Guo wrote:
[..]
> > I like that from a global benefit perspective, but not necessarily from
> > a regression perspective. Is there a minimal fix to PCI to make its
> > current workqueue unbound, then if that goes well come back and move all
> > devices into this scheme?
>
> Hi Dan,
>
> Thank you for your time, and apologies for the delayed reply.
I would not have read an earlier reply over this weekend anyway, so no
worries.
> I understand your concerns about stability and hope for better PCI regression
> handling. However, I believe introducing NUMA-node awareness to the driver
> core's asynchronous probe path is the better solution:
>
> 1. The asynchronous path already uses async_schedule_dev() with queue_work_node()
> to bind workers to specific NUMA nodes—this causes no side effects to driver
> probing.
> 2. I initially submitted a PCI-only fix [1], but handling asynchronous probing in
> PCI driver proved difficult. Using current_is_async() works but feels fragile.
> After discussions with Bjorn and Danilo [2][3], moving the solution to driver
> core makes distinguishing async/sync probing straightforward. Testing shows
> minimal impact on synchronous probe time.
> 3. If you prefer a PCI-only approach, we could add a flag in struct device_driver
> (default false) that PCI sets during registration. This limits the new path to
> PCI devices while others retain existing behavior. The extra code is ~10 lines
> and can be removed once confidence is established.
I am open to this option. One demonstration of how this conversion can
cause odd surprises is what it does to locking assumptions. For example,
I ran into the work_on_cpu(..., local_pci_probe...) behavior with some
of the work-in-progress confidential device work [1]. I was surprised
when lockdep_assert_held() returned false in a driver probe context.
I like that buses can opt-in to this behavior vs it being forced.
Similar to how async-behavior is handled as an opt-in.
[1]: https://git.kernel.org/pub/scm/linux/kernel/git/devsec/tsm.git/tree/drivers/base/coco.c?h=staging#n86
> 4. I'm committed to supporting this: I'll include "Fixes:" tags for any fallout
> and provide patches within a month of any report. Since the logic mirrors the
> core async helper, risk should be low—but I'll take full responsibility
> regardless.
Sounds good.
With the above change you can add:
Acked-by: Dan Williams <dan.j.williams@intel.com>
...and I may carve out some time to upgrade that to Reviewed-by on the
next posting.
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-01-27 3:37 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-22 14:52 [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
2026-01-22 14:52 ` [PATCH v2 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe Jinhui Guo
2026-01-24 1:04 ` [PATCH v2 0/3] Add NUMA-node-aware synchronous probing to driver core dan.j.williams
2026-01-26 9:17 ` Jinhui Guo
2026-01-27 2:32 ` Jinhui Guo
2026-01-27 3:37 ` dan.j.williams
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox