public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] PCI: Avoid work_on_cpu() in async probe workers
@ 2025-12-27 11:33 Jinhui Guo
  2025-12-29 17:20 ` Bjorn Helgaas
  2025-12-29 18:08 ` Tejun Heo
  0 siblings, 2 replies; 9+ messages in thread
From: Jinhui Guo @ 2025-12-27 11:33 UTC (permalink / raw)
  To: bhelgaas, bvanassche, dan.j.williams, alexander.h.duyck, gregkh
  Cc: guojinhui.liam, linux-pci, linux-kernel, stable

Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of
the driver") speeds up the loading of large numbers of device drivers by
submitting asynchronous probe workers to an unbounded workqueue and binding
each worker to the CPU near the device’s NUMA node. These workers are not
scheduled on isolated CPUs because their cpumask is restricted to
housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).

However, when PCI devices reside on the same NUMA node, all their
drivers’ probe workers are bound to the same CPU within that node, yet
the probes still run in parallel because pci_call_probe() invokes
work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use
in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on
system_percpu_wq to bind the probe thread to the first CPU in the
device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).

1. The function __driver_attach() submits an asynchronous worker with
   callback __driver_attach_async_helper().

   __driver_attach()
    async_schedule_dev(__driver_attach_async_helper, dev)
     async_schedule_node(func, dev, dev_to_node(dev))
      async_schedule_node_domain(func, data, node, &async_dfl_domain)
       __async_schedule_node_domain(func, data, node, domain, entry)
        queue_work_node(node, async_wq, &entry->work)

2. The asynchronous probe worker ultimately calls work_on_cpu() in
   pci_call_probe(), binding the worker to the same CPU within the
   device’s NUMA node.

   __driver_attach_async_helper()
    driver_probe_device(drv, dev)
     __driver_probe_device(drv, dev)
      really_probe(dev, drv)
       call_driver_probe(dev, drv)
        dev->bus->probe(dev)
         pci_device_probe(dev)
          __pci_device_probe(drv, pci_dev)
           pci_call_probe(drv, pci_dev, id)
            cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask)
            error = work_on_cpu(cpu, local_pci_probe, &ddi)
             schedule_work_on(cpu, &wfc.work);
              queue_work_on(cpu, system_percpu_wq, work)

To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
already running inside an unbounded asynchronous worker. Because a driver
can be probed asynchronously either by probe_type or by the kernel command
line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
executing within an unbounded workqueue worker and should skip the extra
work_on_cpu() call.

Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
2.4 GHz processor shows a 35 % probe-time improvement with the patch:

Before (all on CPU 0):
  nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
  nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
  nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns

After (spread across CPUs 1, 2, 5):
  nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns
  nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns
  nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns

The improvement grows with more PCI devices because fewer probes contend
for the same CPU.

Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver")
Cc: stable@vger.kernel.org
Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
---
 drivers/pci/pci-driver.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
index 7c2d9d596258..4bc47a84d330 100644
--- a/drivers/pci/pci-driver.c
+++ b/drivers/pci/pci-driver.c
@@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
 	/*
 	 * Prevent nesting work_on_cpu() for the case where a Virtual Function
 	 * device is probed from work_on_cpu() of the Physical device.
+	 * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous
+	 * probe worker when the driver allows asynchronous probing.
 	 */
 	if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
-	    pci_physfn_is_probed(dev)) {
+	    pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
 		cpu = nr_cpu_ids;
 	} else {
 		cpumask_var_t wq_domain_mask;
-- 
2.20.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-01-04 16:01 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-27 11:33 [PATCH] PCI: Avoid work_on_cpu() in async probe workers Jinhui Guo
2025-12-29 17:20 ` Bjorn Helgaas
2025-12-29 18:08 ` Tejun Heo
2025-12-30 14:27   ` Jinhui Guo
2025-12-30 14:44     ` Jinhui Guo
2025-12-30 21:52     ` Bjorn Helgaas
2025-12-31  7:51       ` Jinhui Guo
2025-12-31 16:55         ` Bjorn Helgaas
2026-01-04 16:01           ` Danilo Krummrich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox