From: Bjorn Helgaas <helgaas@kernel.org>
To: Jinhui Guo <guojinhui.liam@bytedance.com>
Cc: bhelgaas@google.com, bvanassche@acm.org,
dan.j.williams@intel.com, alexander.h.duyck@linux.intel.com,
gregkh@linuxfoundation.org, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, stable@vger.kernel.org,
Marco Crivellari <marco.crivellari@suse.com>,
Tejun Heo <tj@kernel.org>
Subject: Re: [PATCH] PCI: Avoid work_on_cpu() in async probe workers
Date: Mon, 29 Dec 2025 11:20:00 -0600 [thread overview]
Message-ID: <20251229172000.GA68570@bhelgaas> (raw)
In-Reply-To: <20251227113326.964-1-guojinhui.liam@bytedance.com>
[+cc Marco, Tejun; just FYI since you have ongoing per-CPU wq work]
On Sat, Dec 27, 2025 at 07:33:26PM +0800, Jinhui Guo wrote:
> Commit ef0ff68351be ("driver core: Probe devices asynchronously instead of
> the driver") speeds up the loading of large numbers of device drivers by
> submitting asynchronous probe workers to an unbounded workqueue and binding
> each worker to the CPU near the device’s NUMA node. These workers are not
> scheduled on isolated CPUs because their cpumask is restricted to
> housekeeping_cpumask(HK_TYPE_WQ) and housekeeping_cpumask(HK_TYPE_DOMAIN).
>
> However, when PCI devices reside on the same NUMA node, all their
> drivers’ probe workers are bound to the same CPU within that node, yet
> the probes still run in parallel because pci_call_probe() invokes
> work_on_cpu(). Introduced by commit 873392ca514f ("PCI: work_on_cpu: use
> in drivers/pci/pci-driver.c"), work_on_cpu() queues a worker on
> system_percpu_wq to bind the probe thread to the first CPU in the
> device’s NUMA node (chosen via cpumask_any_and() in pci_call_probe()).
>
> 1. The function __driver_attach() submits an asynchronous worker with
> callback __driver_attach_async_helper().
>
> __driver_attach()
> async_schedule_dev(__driver_attach_async_helper, dev)
> async_schedule_node(func, dev, dev_to_node(dev))
> async_schedule_node_domain(func, data, node, &async_dfl_domain)
> __async_schedule_node_domain(func, data, node, domain, entry)
> queue_work_node(node, async_wq, &entry->work)
>
> 2. The asynchronous probe worker ultimately calls work_on_cpu() in
> pci_call_probe(), binding the worker to the same CPU within the
> device’s NUMA node.
>
> __driver_attach_async_helper()
> driver_probe_device(drv, dev)
> __driver_probe_device(drv, dev)
> really_probe(dev, drv)
> call_driver_probe(dev, drv)
> dev->bus->probe(dev)
> pci_device_probe(dev)
> __pci_device_probe(drv, pci_dev)
> pci_call_probe(drv, pci_dev, id)
> cpu = cpumask_any_and(cpumask_of_node(node), wq_domain_mask)
> error = work_on_cpu(cpu, local_pci_probe, &ddi)
> schedule_work_on(cpu, &wfc.work);
> queue_work_on(cpu, system_percpu_wq, work)
>
> To fix the issue, pci_call_probe() must not call work_on_cpu() when it is
> already running inside an unbounded asynchronous worker. Because a driver
> can be probed asynchronously either by probe_type or by the kernel command
> line, we cannot rely on PROBE_PREFER_ASYNCHRONOUS alone. Instead, we test
> the PF_WQ_WORKER flag in current->flags; if it is set, pci_call_probe() is
> executing within an unbounded workqueue worker and should skip the extra
> work_on_cpu() call.
>
> Testing three NVMe devices on the same NUMA node of an AMD EPYC 9A64
> 2.4 GHz processor shows a 35 % probe-time improvement with the patch:
>
> Before (all on CPU 0):
> nvme 0000:01:00.0: CPU: 0, COMM: kworker/0:1, probe cost: 53372612 ns
> nvme 0000:02:00.0: CPU: 0, COMM: kworker/0:2, probe cost: 49532941 ns
> nvme 0000:03:00.0: CPU: 0, COMM: kworker/0:3, probe cost: 47315175 ns
>
> After (spread across CPUs 1, 2, 5):
> nvme 0000:01:00.0: CPU: 5, COMM: kworker/u1025:5, probe cost: 34765890 ns
> nvme 0000:02:00.0: CPU: 1, COMM: kworker/u1025:2, probe cost: 34696433 ns
> nvme 0000:03:00.0: CPU: 2, COMM: kworker/u1025:3, probe cost: 33233323 ns
>
> The improvement grows with more PCI devices because fewer probes contend
> for the same CPU.
>
> Fixes: ef0ff68351be ("driver core: Probe devices asynchronously instead of the driver")
> Cc: stable@vger.kernel.org
> Signed-off-by: Jinhui Guo <guojinhui.liam@bytedance.com>
> ---
> drivers/pci/pci-driver.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pci-driver.c b/drivers/pci/pci-driver.c
> index 7c2d9d596258..4bc47a84d330 100644
> --- a/drivers/pci/pci-driver.c
> +++ b/drivers/pci/pci-driver.c
> @@ -366,9 +366,11 @@ static int pci_call_probe(struct pci_driver *drv, struct pci_dev *dev,
> /*
> * Prevent nesting work_on_cpu() for the case where a Virtual Function
> * device is probed from work_on_cpu() of the Physical device.
> + * Check PF_WQ_WORKER to prevent invoking work_on_cpu() in an asynchronous
> + * probe worker when the driver allows asynchronous probing.
> */
> if (node < 0 || node >= MAX_NUMNODES || !node_online(node) ||
> - pci_physfn_is_probed(dev)) {
> + pci_physfn_is_probed(dev) || (current->flags & PF_WQ_WORKER)) {
> cpu = nr_cpu_ids;
> } else {
> cpumask_var_t wq_domain_mask;
> --
> 2.20.1
next prev parent reply other threads:[~2025-12-29 17:20 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-12-27 11:33 [PATCH] PCI: Avoid work_on_cpu() in async probe workers Jinhui Guo
2025-12-29 17:20 ` Bjorn Helgaas [this message]
2025-12-29 18:08 ` Tejun Heo
2025-12-30 14:27 ` Jinhui Guo
2025-12-30 14:44 ` Jinhui Guo
2025-12-30 21:52 ` Bjorn Helgaas
2025-12-31 7:51 ` Jinhui Guo
2025-12-31 16:55 ` Bjorn Helgaas
2026-01-04 16:01 ` Danilo Krummrich
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20251229172000.GA68570@bhelgaas \
--to=helgaas@kernel.org \
--cc=alexander.h.duyck@linux.intel.com \
--cc=bhelgaas@google.com \
--cc=bvanassche@acm.org \
--cc=dan.j.williams@intel.com \
--cc=gregkh@linuxfoundation.org \
--cc=guojinhui.liam@bytedance.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=marco.crivellari@suse.com \
--cc=stable@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.