public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Add NUMA-node-aware synchronous probing to driver core
@ 2026-01-07 17:55 Jinhui Guo
  2026-01-07 17:55 ` [PATCH 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Jinhui Guo @ 2026-01-07 17:55 UTC (permalink / raw)
  To: dakr, alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, helgaas, rafael, tj
  Cc: guojinhui.liam, linux-kernel, linux-pci

Hi all,

** Overview **

This patchset introduces NUMA-node-aware synchronous probing.

Drivers can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout the code.
NUMA-aware probing was added to PCI drivers in 2005 and has
benefited them ever since.

The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since NUMA
affinity is orthogonal to sync/async probing, this patchset adds
NUMA-node-aware support to the synchronous probe path.

** Background **

The idea arose from a discussion with Bjorn and Danilo about a
PCI-probe issue [1]:

when PCI devices on the same NUMA node are probed asynchronously,
pci_call_probe() calls work_on_cpu(), pins every probe worker to
the same CPU inside that node, and forces the probes to run serially.

Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor (all on CPU 0):

  nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
  nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
  nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns

Since the driver core already provides NUMA-node-aware asynchronous
probing, we can extend the same capability to the synchronous probe
path. This solves the issue and lets other drivers benefit from
NUMA-local initialization as well.

[1] https://lore.kernel.org/all/20251227113326.964-1-guojinhui.liam@bytedance.com/

** Changes **

The series makes three main changes:

1. Adds helper __device_attach_driver_scan() to eliminate duplication
   between __device_attach() and __device_attach_async_helper().
2. Introduces a NUMA-node-aware execution mechanism and uses it to
   enable NUMA-local synchronous probing in __device_attach(),
   device_driver_attach(), and __driver_attach().
3. Removes the now-redundant NUMA code from the PCI driver.

** Test **

I added debug prints to nvme, mlx5, usbhid, and intel_rapl_msr and
ran tests on an AMD EPYC 9A64 system:

1. Without the patchset
   - PCI drivers (nvme, mlx5) probe sequentially on CPU 0
   - USB and platform drivers pick random CPUs in the udev worker

   nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 54013202 ns
   nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, cost: 53968911 ns
   nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:4, cost: 48077276 ns
   
   mlx5_core 0000:41:00.0: CPU: 0, COMM: kworker/0:2 cost: 506256717 ns
   mlx5_core 0000:41:00.1: CPU: 0, COMM: kworker/0:2 cost: 514289394 ns
   
   usb 1-2.4: CPU: 163, COMM: (udev-worker), cost 854131 ns
   usb 1-2.6: CPU: 163, COMM: (udev-worker), cost 967993 ns
   
   intel_rapl_msr intel_rapl_msr.0: CPU: 61, COMM: (udev-worker), cost: 3717567 ns

2. With the patchset
   - PCI probes are spread across CPUs inside the device’s NUMA node
   - Asynchronous nvme probes are ~35 % faster; synchronous mlx5 times
     are unchanged
   - USB probe times are virtually identical
   - Platform driver (no NUMA node) falls back to the original path

   nvme 0000:01:00.0: CPU: 130, COMM: kworker/u1025:0, cost: 35074561 ns
   nvme 0000:02:00.0: CPU:   1, COMM: kworker/u1025:6, cost: 34612117 ns
   nvme 0000:03:00.0: CPU:   2, COMM: kworker/u1025:5, cost: 34802918 ns

   mlx5_core 0000:41:00.0: CPU: 128, COMM: kworker/u1025:0, cost: 506214576 ns
   mlx5_core 0000:41:00.1: CPU: 128, COMM: kworker/u1025:0, cost: 514273565 ns

   usb 1-2.4: CPU: 51, COMM: kworker/u1031:2, cost: 933581 ns
   usb 1-2.6: CPU: 51, COMM: kworker/u1031:2, cost: 957237 ns

   intel_rapl_msr intel_rapl_msr.0: CPU: 225, COMM: (udev-worker), cost: 4715967 ns

3. With the patchset, unbind/bind cycles also spread PCI probes across
   CPUs within the device’s NUMA node:

   nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:4, cost: 37070897 ns

** Final **

Comments and suggestions are welcome.

Best Regards,
Jinhui

---
Jinhui Guo (3):
  driver core: Introduce helper function __device_attach_driver_scan()
  driver core: Add NUMA-node awareness to the synchronous probe path
  PCI: Clean up NUMA-node awareness in pci_bus_type probe

 drivers/base/dd.c        | 173 +++++++++++++++++++++++++++++++--------
 drivers/pci/pci-driver.c |  83 ++-----------------
 include/linux/pci.h      |   1 -
 3 files changed, 148 insertions(+), 109 deletions(-)

-- 
2.20.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/3] driver core: Introduce helper function __device_attach_driver_scan()
  2026-01-07 17:55 [PATCH 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
@ 2026-01-07 17:55 ` Jinhui Guo
  2026-01-17 13:36   ` Danilo Krummrich
  2026-01-07 17:55 ` [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
  2026-01-07 17:55 ` [PATCH 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe Jinhui Guo
  2 siblings, 1 reply; 9+ messages in thread
From: Jinhui Guo @ 2026-01-07 17:55 UTC (permalink / raw)
  To: dakr, alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, helgaas, rafael, tj
  Cc: guojinhui.liam, linux-kernel, linux-pci

Introduce a helper to eliminate duplication between
__device_attach() and __device_attach_async_helper();
a later patch will reuse it to add NUMA-node awareness
to the synchronous probe path in __device_attach().

No functional changes.

Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/base/dd.c | 71 ++++++++++++++++++++++++++---------------------
 1 file changed, 40 insertions(+), 31 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 349f31bedfa1..896f98add97d 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -962,6 +962,44 @@ static int __device_attach_driver(struct device_driver *drv, void *_data)
 	return ret == 0;
 }
 
+static int __device_attach_driver_scan(struct device_attach_data *data,
+				       bool *need_async)
+{
+	int ret = 0;
+	struct device *dev = data->dev;
+
+	if (dev->parent)
+		pm_runtime_get_sync(dev->parent);
+
+	ret = bus_for_each_drv(dev->bus, NULL, data,
+			       __device_attach_driver);
+	/*
+	 * When running in an async worker, a NULL need_async is passed
+	 * since we are already in an async worker.
+	 */
+	if (need_async && !ret && data->check_async && data->have_async) {
+		/*
+		 * If we could not find appropriate driver
+		 * synchronously and we are allowed to do
+		 * async probes and there are drivers that
+		 * want to probe asynchronously, we'll
+		 * try them.
+		 */
+		dev_dbg(dev, "scheduling asynchronous probe\n");
+		get_device(dev);
+		*need_async = true;
+	} else {
+		if (!need_async)
+			dev_dbg(dev, "async probe completed\n");
+		pm_request_idle(dev);
+	}
+
+	if (dev->parent)
+		pm_runtime_put(dev->parent);
+
+	return ret;
+}
+
 static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
 {
 	struct device *dev = _dev;
@@ -982,16 +1020,8 @@ static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
 	if (dev->p->dead || dev->driver)
 		goto out_unlock;
 
-	if (dev->parent)
-		pm_runtime_get_sync(dev->parent);
+	__device_attach_driver_scan(&data, NULL);
 
-	bus_for_each_drv(dev->bus, NULL, &data, __device_attach_driver);
-	dev_dbg(dev, "async probe completed\n");
-
-	pm_request_idle(dev);
-
-	if (dev->parent)
-		pm_runtime_put(dev->parent);
 out_unlock:
 	device_unlock(dev);
 
@@ -1025,28 +1055,7 @@ static int __device_attach(struct device *dev, bool allow_async)
 			.want_async = false,
 		};
 
-		if (dev->parent)
-			pm_runtime_get_sync(dev->parent);
-
-		ret = bus_for_each_drv(dev->bus, NULL, &data,
-					__device_attach_driver);
-		if (!ret && allow_async && data.have_async) {
-			/*
-			 * If we could not find appropriate driver
-			 * synchronously and we are allowed to do
-			 * async probes and there are drivers that
-			 * want to probe asynchronously, we'll
-			 * try them.
-			 */
-			dev_dbg(dev, "scheduling asynchronous probe\n");
-			get_device(dev);
-			async = true;
-		} else {
-			pm_request_idle(dev);
-		}
-
-		if (dev->parent)
-			pm_runtime_put(dev->parent);
+		ret = __device_attach_driver_scan(&data, &async);
 	}
 out_unlock:
 	device_unlock(dev);
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path
  2026-01-07 17:55 [PATCH 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
  2026-01-07 17:55 ` [PATCH 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
@ 2026-01-07 17:55 ` Jinhui Guo
  2026-01-07 18:22   ` Danilo Krummrich
  2026-01-17 14:03   ` Danilo Krummrich
  2026-01-07 17:55 ` [PATCH 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe Jinhui Guo
  2 siblings, 2 replies; 9+ messages in thread
From: Jinhui Guo @ 2026-01-07 17:55 UTC (permalink / raw)
  To: dakr, alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, helgaas, rafael, tj
  Cc: guojinhui.liam, linux-kernel, linux-pci

Introduce NUMA-node-aware synchronous probing: drivers
can initialize and allocate memory on the device’s local
node without scattering kmalloc_node() calls throughout
the code.

NUMA-aware probing was first added to PCI drivers by
commit d42c69972b85 ("[PATCH] PCI: Run PCI driver
initialization on local node") in 2005 and has benefited
PCI drivers ever since.

The asynchronous probe path already supports NUMA-node-aware
probing via async_schedule_dev() in the driver core. Since
NUMA affinity is orthogonal to sync/async probing, this
patch adds NUMA-node-aware support to the synchronous
probe path.

Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/base/dd.c | 104 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 101 insertions(+), 3 deletions(-)

diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 896f98add97d..e1fb10ae2cc0 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -381,6 +381,92 @@ static void __exit deferred_probe_exit(void)
 }
 __exitcall(deferred_probe_exit);
 
+/*
+ * NUMA-node-aware synchronous probing:
+ * drivers can initialize and allocate memory on the device’s local
+ * node without scattering kmalloc_node() calls throughout the code.
+ */
+
+/* Generic function pointer type */
+typedef int (*numa_func_t)(void *arg1, void *arg2);
+
+/* Context for NUMA execution */
+struct numa_work_ctx {
+	struct work_struct work;
+	numa_func_t func;
+	void *arg1;
+	void *arg2;
+	int result;
+};
+
+/* Worker function running on the target node */
+static void numa_work_func(struct work_struct *work)
+{
+	struct numa_work_ctx *ctx = container_of(work, struct numa_work_ctx, work);
+
+	ctx->result = ctx->func(ctx->arg1, ctx->arg2);
+}
+
+/*
+ * __exec_on_numa_node - Execute a function on a specific NUMA node synchronously
+ * @node: Target NUMA node ID
+ * @func: The wrapper function to execute
+ * @arg1: First argument (void *)
+ * @arg2: Second argument (void *)
+ *
+ * Returns the result of the function execution, or -ENODEV if initialization fails.
+ * If the node is invalid or offline, it falls back to local execution.
+ */
+static int __exec_on_numa_node(int node, numa_func_t func, void *arg1, void *arg2)
+{
+	struct numa_work_ctx ctx;
+
+	/* Fallback to local execution if the node is invalid or offline */
+	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
+		return func(arg1, arg2);
+
+	ctx.func = func;
+	ctx.arg1 = arg1;
+	ctx.arg2 = arg2;
+	ctx.result = -ENODEV;
+	INIT_WORK_ONSTACK(&ctx.work, numa_work_func);
+
+	/* Use system_dfl_wq to allow execution on the specific node. */
+	queue_work_node(node, system_dfl_wq, &ctx.work);
+	flush_work(&ctx.work);
+	destroy_work_on_stack(&ctx.work);
+
+	return ctx.result;
+}
+
+/*
+ * DEFINE_NUMA_WRAPPER - Generate a type-safe wrapper for a function
+ * @func_name: The name of the target function
+ * @type1: The type of the first argument
+ * @type2: The type of the second argument
+ *
+ * This macro generates a static function named __wrapper_<func_name> that
+ * casts void pointers back to their original types and calls the target function.
+ */
+#define DEFINE_NUMA_WRAPPER(func_name, type1, type2)			\
+	static int __wrapper_##func_name(void *arg1, void *arg2)	\
+	{								\
+		return func_name((type1)arg1, (type2)arg2);		\
+	}
+
+/*
+ * EXEC_ON_NUMA_NODE - Execute a registered function on a NUMA node
+ * @node: Target NUMA node ID
+ * @func_name: The name of the target function (must be registered via DEFINE_NUMA_WRAPPER)
+ * @arg1: First argument
+ * @arg2: Second argument
+ *
+ * This macro invokes the internal execution helper using the generated wrapper.
+ */
+#define EXEC_ON_NUMA_NODE(node, func_name, arg1, arg2)		\
+	__exec_on_numa_node(node, __wrapper_##func_name,	\
+			(void *)(arg1), (void *)(arg2))
+
 /**
  * device_is_bound() - Check if device is bound to a driver
  * @dev: device to check
@@ -808,6 +894,8 @@ static int __driver_probe_device(const struct device_driver *drv, struct device
 	return ret;
 }
 
+DEFINE_NUMA_WRAPPER(__driver_probe_device, const struct device_driver *, struct device *)
+
 /**
  * driver_probe_device - attempt to bind device & driver together
  * @drv: driver to bind a device to
@@ -844,6 +932,8 @@ static int driver_probe_device(const struct device_driver *drv, struct device *d
 	return ret;
 }
 
+DEFINE_NUMA_WRAPPER(driver_probe_device, const struct device_driver *, struct device *)
+
 static inline bool cmdline_requested_async_probing(const char *drv_name)
 {
 	bool async_drv;
@@ -1000,6 +1090,8 @@ static int __device_attach_driver_scan(struct device_attach_data *data,
 	return ret;
 }
 
+DEFINE_NUMA_WRAPPER(__device_attach_driver_scan, struct device_attach_data *, bool *)
+
 static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
 {
 	struct device *dev = _dev;
@@ -1055,7 +1147,9 @@ static int __device_attach(struct device *dev, bool allow_async)
 			.want_async = false,
 		};
 
-		ret = __device_attach_driver_scan(&data, &async);
+		ret = EXEC_ON_NUMA_NODE(dev_to_node(dev),
+					__device_attach_driver_scan,
+					&data, &async);
 	}
 out_unlock:
 	device_unlock(dev);
@@ -1142,7 +1236,9 @@ int device_driver_attach(const struct device_driver *drv, struct device *dev)
 	int ret;
 
 	__device_driver_lock(dev, dev->parent);
-	ret = __driver_probe_device(drv, dev);
+	ret = EXEC_ON_NUMA_NODE(dev_to_node(dev),
+				__driver_probe_device,
+				drv, dev);
 	__device_driver_unlock(dev, dev->parent);
 
 	/* also return probe errors as normal negative errnos */
@@ -1231,7 +1327,9 @@ static int __driver_attach(struct device *dev, void *data)
 	}
 
 	__device_driver_lock(dev, dev->parent);
-	driver_probe_device(drv, dev);
+	EXEC_ON_NUMA_NODE(dev_to_node(dev),
+			  driver_probe_device,
+			  drv, dev);
 	__device_driver_unlock(dev, dev->parent);
 
 	return 0;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe
  2026-01-07 17:55 [PATCH 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
  2026-01-07 17:55 ` [PATCH 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
  2026-01-07 17:55 ` [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
@ 2026-01-07 17:55 ` Jinhui Guo
  2 siblings, 0 replies; 9+ messages in thread
From: Jinhui Guo @ 2026-01-07 17:55 UTC (permalink / raw)
  To: dakr, alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, helgaas, rafael, tj
  Cc: guojinhui.liam, linux-kernel, linux-pci

With NUMA-node-aware probing now handled by the driver core,
the equivalent code in the PCI driver is redundant and can
be removed.

Dropping it speeds up asynchronous probe by 35%; the gain
comes from eliminating the work_on_cpu() call in pci_call_probe()
that previously pinned every worker to the same CPU, forcing
serial probe of devices on the same NUMA node.

Testing three NVMe devices on the same NUMA node of an AMD
EPYC 9A64 2.4 GHz processor shows a 35% probe-time improvement
with the patch:

Before (all on CPU 0):
  nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, cost: 52266334ns
  nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:0, cost: 50787194ns
  nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:2, cost: 50541584ns

After (spread across CPUs 1, 2, 4):
  nvme 0000:01:00.0: CPU: 1, COMM: kworker/u1025:2, cost: 35399608ns
  nvme 0000:02:00.0: CPU: 2, COMM: kworker/u1025:3, cost: 35156157ns
  nvme 0000:03:00.0: CPU: 4, COMM: kworker/u1025:0, cost: 35322116ns

The improvement grows with more PCI devices because fewer probes
contend for the same CPU.

Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/pci/pci-driver.c | 83 ++++------------------------------------
 include/linux/pci.h      |  1 -
 2 files changed, 8 insertions(+), 76 deletions(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d596258..683bc682e750 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -296,18 +296,9 @@ static struct attribute *pci_drv_attrs[] = {
 };
 ATTRIBUTE_GROUPS(pci_drv);
 
-struct drv_dev_and_id {
-	struct pci_driver *drv;
-	struct pci_dev *dev;
-	const struct pci_device_id *id;
-};
-
-static long local_pci_probe(void *_ddi)
+static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
+			  const struct pci_device_id *id)
 {
-	struct drv_dev_and_id *ddi = _ddi;
-	struct pci_dev *pci_dev = ddi->dev;
-	struct pci_driver *pci_drv = ddi->drv;
-	struct device *dev = &pci_dev->dev;
 	int rc;
 
 	/*
@@ -319,83 +310,25 @@ static long local_pci_probe(void *_ddi)
 	 * count, in its probe routine and pm_runtime_get_noresume() in
 	 * its remove routine.
 	 */
-	pm_runtime_get_sync(dev);
-	pci_dev->driver = pci_drv;
-	rc = pci_drv->probe(pci_dev, ddi->id);
+	pm_runtime_get_sync(&dev->dev);
+	dev->driver = drv;
+	rc = drv->probe(dev, id);
 	if (!rc)
 		return rc;
 	if (rc < 0) {
-		pci_dev->driver = NULL;
-		pm_runtime_put_sync(dev);
+		dev->driver = NULL;
+		pm_runtime_put_sync(&dev->dev);
 		return rc;
 	}
 	/*
 	 * Probe function should return < 0 for failure, 0 for success
 	 * Treat values > 0 as success, but warn.
 	 */
-	pci_warn(pci_dev, "Driver probe function unexpectedly returned %d\n",
+	pci_warn(dev, "Driver probe function unexpectedly returned %d\n",
 		 rc);
 	return 0;
 }
 
-static bool pci_physfn_is_probed(struct pci_dev *dev)
-{
-#ifdef CONFIG_PCI_IOV
-	return dev->is_virtfn && dev->physfn->is_probed;
-#else
-	return false;
-#endif
-}
-
-static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
-			  const struct pci_device_id *id)
-{
-	int error, node, cpu;
-	struct drv_dev_and_id ddi = { drv, dev, id };
-
-	/*
-	 * Execute driver initialization on node where the device is
-	 * attached.  This way the driver likely allocates its local memory
-	 * on the right node.
-	 */
-	node = dev_to_node(&dev->dev);
-	dev->is_probed = 1;
-
-	cpu_hotplug_disable();
-
-	/*
-	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
-	 * device is probed from work_on_cpu() of the Physical device.
-	 */
-	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
-	    pci_physfn_is_probed(dev)) {
-		cpu = nr_cpu_ids;
-	} else {
-		cpumask_var_t wq_domain_mask;
-
-		if (!zalloc_cpumask_var(&wq_domain_mask, GFP_KERNEL)) {
-			error = -ENOMEM;
-			goto out;
-		}
-		cpumask_and(wq_domain_mask,
-			    housekeeping_cpumask(HK_TYPE_WQ),
-			    housekeeping_cpumask(HK_TYPE_DOMAIN));
-
-		cpu = cpumask_any_and(cpumask_of_node(node),
-				      wq_domain_mask);
-		free_cpumask_var(wq_domain_mask);
-	}
-
-	if (cpu < nr_cpu_ids)
-		error = work_on_cpu(cpu, local_pci_probe, &ddi);
-	else
-		error = local_pci_probe(&ddi);
-out:
-	dev->is_probed = 0;
-	cpu_hotplug_enable();
-	return error;
-}
-
 /**
  * __pci_device_probe - check if a driver wants to claim a specific PCI device
  * @drv: driver to call to check if it wants the PCI device
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 864775651c6f..cbc0db2f2b84 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -481,7 +481,6 @@ struct pci_dev {
 	unsigned int	io_window_1k:1;		/* Intel bridge 1K I/O windows */
 	unsigned int	irq_managed:1;
 	unsigned int	non_compliant_bars:1;	/* Broken BARs; ignore them */
-	unsigned int	is_probed:1;		/* Device probing in progress */
 	unsigned int	link_active_reporting:1;/* Device capable of reporting link active */
 	unsigned int	no_vf_scan:1;		/* Don't scan for VFs after IOV enablement */
 	unsigned int	no_command_memory:1;	/* No PCI_COMMAND_MEMORY */
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path
  2026-01-07 17:55 ` [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
@ 2026-01-07 18:22   ` Danilo Krummrich
  2026-01-08  8:28     ` Jinhui Guo
  2026-01-17 14:03   ` Danilo Krummrich
  1 sibling, 1 reply; 9+ messages in thread
From: Danilo Krummrich @ 2026-01-07 18:22 UTC (permalink / raw)
  To: Jinhui Guo
  Cc: alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, helgaas, rafael, tj, linux-kernel,
	linux-pci

On Wed Jan 7, 2026 at 6:55 PM CET, Jinhui Guo wrote:
> + * __exec_on_numa_node - Execute a function on a specific NUMA node synchronously
> + * @node: Target NUMA node ID
> + * @func: The wrapper function to execute
> + * @arg1: First argument (void *)
> + * @arg2: Second argument (void *)
> + *
> + * Returns the result of the function execution, or -ENODEV if initialization fails.
> + * If the node is invalid or offline, it falls back to local execution.
> + */
> +static int __exec_on_numa_node(int node, numa_func_t func, void *arg1, void *arg2)
> +{
> +	struct numa_work_ctx ctx;
> +
> +	/* Fallback to local execution if the node is invalid or offline */
> +	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
> +		return func(arg1, arg2);

Just a quick drive-by comment (I’ll go through it more thoroughly later).

What about the case where we are already on the requested node?

Also, we should probably set the corresponding CPU affinity for the time we are
executing func() to prevent migration.

> +
> +	ctx.func = func;
> +	ctx.arg1 = arg1;
> +	ctx.arg2 = arg2;
> +	ctx.result = -ENODEV;
> +	INIT_WORK_ONSTACK(&ctx.work, numa_work_func);
> +
> +	/* Use system_dfl_wq to allow execution on the specific node. */
> +	queue_work_node(node, system_dfl_wq, &ctx.work);
> +	flush_work(&ctx.work);
> +	destroy_work_on_stack(&ctx.work);
> +
> +	return ctx.result;
> +}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path
  2026-01-07 18:22   ` Danilo Krummrich
@ 2026-01-08  8:28     ` Jinhui Guo
  0 siblings, 0 replies; 9+ messages in thread
From: Jinhui Guo @ 2026-01-08  8:28 UTC (permalink / raw)
  To: dakr
  Cc: alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, guojinhui.liam, helgaas, linux-kernel,
	linux-pci, rafael, tj

On Wed Jan 07, 2026 at 19:22:15 +0100, Danilo Krummrich wrote:
> On Wed Jan 7, 2026 at 6:55 PM CET, Jinhui Guo wrote:
> > + * __exec_on_numa_node - Execute a function on a specific NUMA node synchronously
> > + * @node: Target NUMA node ID
> > + * @func: The wrapper function to execute
> > + * @arg1: First argument (void *)
> > + * @arg2: Second argument (void *)
> > + *
> > + * Returns the result of the function execution, or -ENODEV if initialization fails.
> > + * If the node is invalid or offline, it falls back to local execution.
> > + */
> > +static int __exec_on_numa_node(int node, numa_func_t func, void *arg1, void *arg2)
> > +{
> > +	struct numa_work_ctx ctx;
> > +
> > +	/* Fallback to local execution if the node is invalid or offline */
> > +	if (node < 0 || node >= MAX_NUMNODES || !node_online(node))
> > +		return func(arg1, arg2);
> 
> Just a quick drive-by comment (I’ll go through it more thoroughly later).
> 
> What about the case where we are already on the requested node?
> 
> Also, we should probably set the corresponding CPU affinity for the time we are
> executing func() to prevent migration.

Hi Danilo,

Thank you for your time and helpful comments.

Relying on queue_work_node() for node affinity is safer, even if the thread
is already on the target CPU.

Checking the current CPU and then setting affinity ourselves would require
handling CPU-hotplug and isolated CPUs—corner cases that become complex
quickly.

The PCI driver tried this years ago and ran into numerous problems; delegating
the decision to queue_work_node() avoids repeating that history.

- Commit d42c69972b85 ("[PATCH] PCI: Run PCI driver initialization on local node")
  first added NUMA awareness with set_cpus_allowed_ptr().
- Commit 1ddd45f8d76f ("PCI: Use cpu_hotplug_disable() instead of get_online_cpus()")
  handled CPU-hotplug.
- Commits 69a18b18699b ("PCI: Restrict probe functions to housekeeping CPUs") and
  9d42ea0d6984 ("pci: Decouple HK_FLAG_WQ and HK_FLAG_DOMAIN cpumask fetch") dealt
  with isolated CPUs.

I considered setting CPU affinity, but the performance gain is minimal:

1. Driver probing happens mainly at boot, when load is light, so queuing a worker
   incurs little delay.
2. With many devices they are usually spread across nodes, so workers are not
   stalled long within any NUMA node.
3. Even after pinning, tasks can still be migrated by load balancing within the
   NUMA node, so the reduction in context switches versus using queue_work_node()
   alone is negligible.

Test data [1] shows that queue_work_node() has negligible impact on synchronous probe time.

[1] https://lore.kernel.org/all/20260107175548.1792-1-guojinhui.liam@bytedance.com/

If you have any other concerns, please let me know.

Best Regards,
Jinhui

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 1/3] driver core: Introduce helper function __device_attach_driver_scan()
  2026-01-07 17:55 ` [PATCH 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
@ 2026-01-17 13:36   ` Danilo Krummrich
  0 siblings, 0 replies; 9+ messages in thread
From: Danilo Krummrich @ 2026-01-17 13:36 UTC (permalink / raw)
  To: Jinhui Guo
  Cc: alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, helgaas, rafael, tj, linux-kernel,
	linux-pci

On Wed Jan 7, 2026 at 6:55 PM CET, Jinhui Guo wrote:
> Introduce a helper to eliminate duplication between
> __device_attach() and __device_attach_async_helper();
> a later patch will reuse it to add NUMA-node awareness
> to the synchronous probe path in __device_attach().
>
> No functional changes.
>
> Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>

Reviewed-by: Danilo Krummrich <dakr@kernel.org>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path
  2026-01-07 17:55 ` [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
  2026-01-07 18:22   ` Danilo Krummrich
@ 2026-01-17 14:03   ` Danilo Krummrich
  2026-01-20 17:23     ` Jinhui Guo
  1 sibling, 1 reply; 9+ messages in thread
From: Danilo Krummrich @ 2026-01-17 14:03 UTC (permalink / raw)
  To: Jinhui Guo
  Cc: alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, helgaas, rafael, tj, linux-kernel,
	linux-pci

On Wed Jan 7, 2026 at 6:55 PM CET, Jinhui Guo wrote:
> @@ -808,6 +894,8 @@ static int __driver_probe_device(const struct device_driver *drv, struct device
>  	return ret;
>  }
>  
> +DEFINE_NUMA_WRAPPER(__driver_probe_device, const struct device_driver *, struct device *)
> +
>  /**
>   * driver_probe_device - attempt to bind device & driver together
>   * @drv: driver to bind a device to
> @@ -844,6 +932,8 @@ static int driver_probe_device(const struct device_driver *drv, struct device *d
>  	return ret;
>  }
>  
> +DEFINE_NUMA_WRAPPER(driver_probe_device, const struct device_driver *, struct device *)
> +
>  static inline bool cmdline_requested_async_probing(const char *drv_name)
>  {
>  	bool async_drv;
> @@ -1000,6 +1090,8 @@ static int __device_attach_driver_scan(struct device_attach_data *data,
>  	return ret;
>  }
>  
> +DEFINE_NUMA_WRAPPER(__device_attach_driver_scan, struct device_attach_data *, bool *)

Why define three different wrappers? To me it looks like we should easily get
away with a single wrapper for __driver_probe_device(), which could just be
__driver_probe_device_node().


__device_attach_driver_scan() already has this information (i.e. we can check if
need_async == NULL). Additionally, we can change the signature of
driver_probe_device() to

	static int driver_probe_device(const struct device_driver *drv, struct device *dev, bool async)

This reduces complexity a lot, since it gets us rid of DEFINE_NUMA_WRAPPER() and
EXEC_ON_NUMA_NODE() macros.

>  static void __device_attach_async_helper(void *_dev, async_cookie_t cookie)
>  {
>  	struct device *dev = _dev;
> @@ -1055,7 +1147,9 @@ static int __device_attach(struct device *dev, bool allow_async)
>  			.want_async = false,
>  		};
>  
> -		ret = __device_attach_driver_scan(&data, &async);
> +		ret = EXEC_ON_NUMA_NODE(dev_to_node(dev),
> +					__device_attach_driver_scan,
> +					&data, &async);
>  	}
>  out_unlock:
>  	device_unlock(dev);
> @@ -1142,7 +1236,9 @@ int device_driver_attach(const struct device_driver *drv, struct device *dev)
>  	int ret;
>  
>  	__device_driver_lock(dev, dev->parent);
> -	ret = __driver_probe_device(drv, dev);
> +	ret = EXEC_ON_NUMA_NODE(dev_to_node(dev),
> +				__driver_probe_device,
> +				drv, dev);
>  	__device_driver_unlock(dev, dev->parent);
>  
>  	/* also return probe errors as normal negative errnos */
> @@ -1231,7 +1327,9 @@ static int __driver_attach(struct device *dev, void *data)
>  	}
>  
>  	__device_driver_lock(dev, dev->parent);
> -	driver_probe_device(drv, dev);
> +	EXEC_ON_NUMA_NODE(dev_to_node(dev),
> +			  driver_probe_device,
> +			  drv, dev);
>  	__device_driver_unlock(dev, dev->parent);
>  
>  	return 0;
> -- 
> 2.20.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path
  2026-01-17 14:03   ` Danilo Krummrich
@ 2026-01-20 17:23     ` Jinhui Guo
  0 siblings, 0 replies; 9+ messages in thread
From: Jinhui Guo @ 2026-01-20 17:23 UTC (permalink / raw)
  To: dakr
  Cc: alexander.h.duyck, alexanderduyck, bhelgaas, bvanassche,
	dan.j.williams, gregkh, guojinhui.liam, helgaas, linux-kernel,
	linux-pci, rafael, tj

On Sat Jan 17, 2026 15:03:08 +0100, Danilo Krummrich wrote:
> On Wed Jan 7, 2026 at 6:55 PM CET, Jinhui Guo wrote:
> > @@ -808,6 +894,8 @@ static int __driver_probe_device(const struct device_driver *drv, struct device
> >  	return ret;
> >  }
> >  
> > +DEFINE_NUMA_WRAPPER(__driver_probe_device, const struct device_driver *, struct device *)
> > +
> >  /**
> >   * driver_probe_device - attempt to bind device & driver together
> >   * @drv: driver to bind a device to
> > @@ -844,6 +932,8 @@ static int driver_probe_device(const struct device_driver *drv, struct device *d
> >  	return ret;
> >  }
> >  
> > +DEFINE_NUMA_WRAPPER(driver_probe_device, const struct device_driver *, struct device *)
> > +
> >  static inline bool cmdline_requested_async_probing(const char *drv_name)
> >  {
> >  	bool async_drv;
> > @@ -1000,6 +1090,8 @@ static int __device_attach_driver_scan(struct device_attach_data *data,
> >  	return ret;
> >  }
> >  
> > +DEFINE_NUMA_WRAPPER(__device_attach_driver_scan, struct device_attach_data *, bool *)
> 
> Why define three different wrappers? To me it looks like we should easily get
> away with a single wrapper for __driver_probe_device(), which could just be
> __driver_probe_device_node().
> 
> 
> __device_attach_driver_scan() already has this information (i.e. we can check if
> need_async == NULL). Additionally, we can change the signature of
> driver_probe_device() to
> 
> 	static int driver_probe_device(const struct device_driver *drv, struct device *dev, bool async)
> 
> This reduces complexity a lot, since it gets us rid of DEFINE_NUMA_WRAPPER() and
> EXEC_ON_NUMA_NODE() macros.

Hi Danilo,

Thank you for your time and helpful comments.

Apologies for the delayed reply. I understand your concern: before sending this
patchset I prototyped a version that added __driver_probe_device_node() and
relied solely on current_is_async() to detect an async worker, without changing
driver_probe_device()’s signature. That proved fragile, so I abandoned it; your
suggestion is the more reliable path forward.

I’ve spent the last couple of days preparing a new patch and will send it out
after testing.

Best Regards,
Jinhui

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-20 17:23 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 17:55 [PATCH 0/3] Add NUMA-node-aware synchronous probing to driver core Jinhui Guo
2026-01-07 17:55 ` [PATCH 1/3] driver core: Introduce helper function __device_attach_driver_scan() Jinhui Guo
2026-01-17 13:36   ` Danilo Krummrich
2026-01-07 17:55 ` [PATCH 2/3] driver core: Add NUMA-node awareness to the synchronous probe path Jinhui Guo
2026-01-07 18:22   ` Danilo Krummrich
2026-01-08  8:28     ` Jinhui Guo
2026-01-17 14:03   ` Danilo Krummrich
2026-01-20 17:23     ` Jinhui Guo
2026-01-07 17:55 ` [PATCH 3/3] PCI: Clean up NUMA-node awareness in pci_bus_type probe Jinhui Guo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox